feat: surface token usage metadata for billing (#907) by yyyyaaa · Pull Request #10 · constructive-io/agentic-kit

yyyyaaa · 2026-05-21T05:56:11Z

Summary

Producer side of the billing/metering hookup tracked in constructive-io/constructive-planning#907. Surfaces prompt_tokens, completion_tokens, total_tokens, and reasoning_tokens on agentic-kit responses, plus a cumulative rollup on the agent loop and React hook.

Field mapping (for the billing consumer)

Billing field	agentic-kit field
`prompt_tokens`	`usage.input`
`completion_tokens`	`usage.output`
`reasoning_tokens`	`usage.reasoning`
`total_tokens`	`usage.totalTokens`
`cache_read_tokens`	`usage.cacheRead`
`cache_write_tokens`	`usage.cacheWrite`

reasoning is a subset of output (matches OpenAI's wire contract — completion_tokens already includes reasoning). totalTokens = input + output + cacheRead + cacheWrite — invariant unchanged, no double-counting. Billing computes pure-completion as output - reasoning.

Commit structure

Each concern is its own commit so reviewers can land them independently if needed:

feat(agentic-kit) — Add reasoning to Usage, add addUsage/snapshotUsage helpers, drop unused return from calculateUsageCost.
fix(openai) — Stop double-counting reasoning into output, expose usage.reasoning, include cacheWrite in the totalTokens fallback.
fix(ollama) — Actually invoke calculateUsageCost so cost.total populates (was silently zero).
fix(anthropic) — Initialize reasoning: 0 (API does not expose this field).
feat(agent) — Accumulate totalUsage on AgentState, snapshot onto turn_end/agent_end events, reset on prompt(), preserve across continue().
feat(react) — Surface usage: Usage | null on useChat with totalTokens-keyed change-detection guard to avoid no-op re-renders.
docs — Append-only LLM_METADATA_DECISIONS.md recording the design decisions.

Out of scope (explicit)

OpenAI-named alias fields (promptTokens, etc.) — single canonical shape; consumers translate at the boundary.
OpenRouter prompt_tokens_details.cache_write_tokens ingestion — no consumer yet.
Service-tier cost multipliers (flex/priority).
Separate cost rate for reasoning tokens — every provider we ship currently prices reasoning at the output rate.

Test plan

pnpm build succeeds across the workspace (ESM/CJS dual output)
pnpm -r test — new assertions pass, existing usage assertions still pass
Manual: single-turn OpenAI call with a reasoning model — output > 0, reasoning > 0, totalTokens === input + output + cacheRead + cacheWrite
Manual: multi-turn tool-use call — state.totalUsage matches manual sum across turns field-for-field
Manual: Anthropic call with cache-primed prompt — reasoning === 0, cacheRead/cacheWrite non-zero
Manual: Ollama call against a model with cost set on the descriptor — cost.total > 0
Manual: React example app or mocked SSE — useChat().usage populates on agent_end and resets on new prompt()

Add `Usage.reasoning: number` (required, defaults to 0) so billing/metering consumers can distinguish pure-completion tokens from reasoning tokens. Per the OpenAI wire contract, reasoning is a subset of `output` rather than a sibling, so `totalTokens` keeps the existing invariant (input + output + cacheRead + cacheWrite) — no double-counting. Add two helpers: - `snapshotUsage(usage)` — two-level shallow copy used by event emits so consumers receive a stable value independent of subsequent accumulation. - `addUsage(target, delta)` — in-place additive accumulator for cumulative totals across turns. Returns the mutated target for chaining. Align `calculateUsageCost` to return `void` (matches the per-provider local copies; no caller used the previous `Usage['cost']` return). Export `ZERO_USAGE` from `tools/test/fixtures.ts` so workspace tests share a single canonical zero-usage literal instead of duplicating it.

…okens fallback - Stop double-counting reasoning by setting usage.output to completion_tokens (which already includes reasoning per OpenAI's wire contract) - Expose reasoning as a separate read-only count on usage.reasoning - Include cacheWrite in the totalTokens fallback when total_tokens is absent

The Ollama adapter previously assigned input/output/totalTokens but never ran the cost schedule, leaving cost.total at zero even when the model descriptor defined per-token rates. Apply the local calculateUsageCost helper after token assignment so the same Usage invariants hold across providers.

The Anthropic API does not expose a reasoning-token count even when extended thinking is enabled — thinking cost is server-side folded into output_tokens. Initialize usage.reasoning to 0 so the field is present and add a regression guard so we do not later populate it from a hallucinated payload field.

…ent_end Add totalUsage to AgentState and accumulate per-message Usage (tokens and cost.*) as each turn completes. Snapshot the rolling total onto turn_end and agent_end events so consumers can read a cumulative figure without re-walking messages[]. Reset on prompt() (matching stepCount semantics), preserve across continue().

… guard Surface the cumulative Usage from agent_end/turn_end as a top-level useChat field, null before the first event. Guard the setter on totalTokens so re-render does not fire when nothing changed. Reset to null on every new prompt(), mirroring the agent-side reset.

Capture the design decisions for the token usage metadata work that landed in this branch (reasoning-as-subset, no provider-named aliases, cumulative usage location, shared cost helper, etc.) as an append-only log next to REDESIGN_DECISIONS.md.

The merge order in createModel spread builtIn.compat first then this.compat, so the adapter's generic default (maxTokensField: 'max_tokens') silently clobbered the model-specific override that the built-in entry sets ('max_completion_tokens' for reasoning-capable models). Result: every reasoning model loaded via createModel sent the wrong field name and OpenAI returned 400 "Unsupported parameter: 'max_tokens'". Same bug applied to headers. The mock-mode unit tests didn't catch it because the mocked fetch never validated the request body — the live smoke test caught it on the first real call. Swap to: adapter defaults → built-in catalog → caller overrides, so the most specific source wins. Adds two regression tests.

Empirical-verification pass for the LLM-call metadata work (issue #907). Until now every assertion rested on mocked SSE streams; this adds live provider eval suites that hit real endpoints and verify three load-bearing claims against actual wire payloads: - OpenAI: `completion_tokens` already includes `reasoning_tokens` (so `output = completion_tokens`, no carve-out) - OpenAI: `prompt_tokens_details.cached_tokens` populates on ≥1024-token prefix matches and surfaces as `usage.cacheRead` - Ollama: thinking content has no associated token count; `usage.reasoning` must stay 0 even on a thinking-on Qwen3 turn The reasoning-subset claim is the one that drove removing `+ reasoningTokens` from the OpenAI output extractor — live verification confirms the wire shape matches our assumption against `gpt-5.4-nano`. Infrastructure: - `tools/test/load-env.js` walks up to find a workspace `.env`; silent if absent so CI is unaffected - `tools/test/live.ts` provides `liveDescribe`/`requireEnv`/`suiteLevel` helpers (OpenAI/Ollama/Agent suites use the equivalent inline pattern for full TypeScript inference; the helpers are exported for future use) - Each suite is gated by `<NAMESPACE>_LIVE_SUITE=smoke|extended`; runner scripts set `*_LIVE_READY=1` which un-ignores the live test files and disables the global `fetch` mock in jest setup - New root scripts: `test:live:openai{,:smoke,:extended}`, `test:live:agent{,:smoke,:extended}` - Ollama smoke runner unchanged; ollama.live.test.ts extended with a new `Ollama live token-usage audit` describe block (4 tests) - `.gitignore` updated to cover `.env`/`.env.local` (secrets-leak gap) Suites are excluded from default `pnpm test` via `testPathIgnorePatterns`, require explicit env vars, and never run in CI. See `LLM_METADATA_DECISIONS.md` #19 for the rationale and #20 for the adapter→builtin→override precedence bug the live tests uncovered. Verified locally: - `pnpm test:live:openai:extended`: 5/5 pass - `pnpm test:live:ollama:extended`: 12/12 pass - `pnpm test:live:agent:extended`: 3/3 pass - `pnpm test` (default): 109/109 pass with no live env vars set

yyyyaaa · 2026-05-21T09:34:43Z

Live-provider eval hardening pass

Pushed two follow-up commits:

ff81bc8 — fix(openai): restore adapter → built-in → override precedence in createModel. The original spread order silently clobbered model-specific compat (notably maxTokensField: 'max_completion_tokens' for reasoning models), causing OpenAI to return 400 for every reasoning model. Mocked unit tests didn't catch it because the mock fetch never validated the body. Live smoke caught it on the first real call. Adds two regression unit tests.
5118c06 — test(live): opt-in provider eval suites for OpenAI / Ollama / Agent. Excluded from default pnpm test, gated by *_LIVE_SUITE=smoke|extended env vars, never run in CI.

Claims now empirically verified

Claim	Verified
OpenAI: `output = completion_tokens` (reasoning_tokens are a subset, not additive)	✅ `gpt-5.4-nano` w/ `reasoning_effort: low`
OpenAI: `cacheRead` populates from `prompt_tokens_details.cached_tokens` on ≥1024-token prefix match	✅ `gpt-5.4-nano`
OpenAI: `totalTokens` matches wire `total_tokens`	✅ `gpt-5.4-nano`
Cost rates apply (`cost.total = sum(cost.*)`)	✅ OpenAI + Ollama
Ollama: thinking content has no token count; `usage.reasoning` stays 0	✅ `qwen3:0.6b` w/ `think: true`
Thinking events round-trip end-to-end	✅ `qwen3:0.6b`
Cumulative `state.totalUsage` equals field-wise sum across turns	✅ via `Agent` on `gpt-5.4-nano`

Local results

pnpm test (default, no env): 109/109 pass — live tests correctly excluded
pnpm test:live:openai:extended: 5/5 pass
pnpm test:live:ollama:extended: 12/12 pass
pnpm test:live:agent:extended: 3/3 pass

Decision log updates

LLM_METADATA_DECISIONS.md #19: live eval gating policy
LLM_METADATA_DECISIONS.md #20: compat precedence rule

Deferred (known gap)

Anthropic live eval: same metadata surface as OpenAI/Ollama, same correctness risk. Not in scope for this pass; the shared infra (tools/test/load-env.js, tools/test/live.ts, runner script template) is provider-neutral and ready to extend.

yyyyaaa added 9 commits May 21, 2026 12:53

pyramation marked this pull request as ready for review May 21, 2026 20:54

pyramation merged commit 79067c6 into main May 21, 2026
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: surface token usage metadata for billing (#907)#10

feat: surface token usage metadata for billing (#907)#10
pyramation merged 9 commits into
mainfrom
worktree-issue-907-llm-call-metadata

yyyyaaa commented May 21, 2026

Uh oh!

yyyyaaa commented May 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yyyyaaa commented May 21, 2026

Summary

Field mapping (for the billing consumer)

Commit structure

Out of scope (explicit)

Test plan

Uh oh!

yyyyaaa commented May 21, 2026

Live-provider eval hardening pass

Claims now empirically verified

Local results

Decision log updates

Deferred (known gap)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants