Skip to content

feat: surface token usage metadata for billing (#907)#10

Merged
pyramation merged 9 commits into
mainfrom
worktree-issue-907-llm-call-metadata
May 21, 2026
Merged

feat: surface token usage metadata for billing (#907)#10
pyramation merged 9 commits into
mainfrom
worktree-issue-907-llm-call-metadata

Conversation

@yyyyaaa
Copy link
Copy Markdown
Contributor

@yyyyaaa yyyyaaa commented May 21, 2026

Summary

Producer side of the billing/metering hookup tracked in constructive-io/constructive-planning#907. Surfaces prompt_tokens, completion_tokens, total_tokens, and reasoning_tokens on agentic-kit responses, plus a cumulative rollup on the agent loop and React hook.

Field mapping (for the billing consumer)

Billing field agentic-kit field
prompt_tokens usage.input
completion_tokens usage.output
reasoning_tokens usage.reasoning
total_tokens usage.totalTokens
cache_read_tokens usage.cacheRead
cache_write_tokens usage.cacheWrite

reasoning is a subset of output (matches OpenAI's wire contract — completion_tokens already includes reasoning). totalTokens = input + output + cacheRead + cacheWrite — invariant unchanged, no double-counting. Billing computes pure-completion as output - reasoning.

Commit structure

Each concern is its own commit so reviewers can land them independently if needed:

  1. feat(agentic-kit) — Add reasoning to Usage, add addUsage/snapshotUsage helpers, drop unused return from calculateUsageCost.
  2. fix(openai) — Stop double-counting reasoning into output, expose usage.reasoning, include cacheWrite in the totalTokens fallback.
  3. fix(ollama) — Actually invoke calculateUsageCost so cost.total populates (was silently zero).
  4. fix(anthropic) — Initialize reasoning: 0 (API does not expose this field).
  5. feat(agent) — Accumulate totalUsage on AgentState, snapshot onto turn_end/agent_end events, reset on prompt(), preserve across continue().
  6. feat(react) — Surface usage: Usage | null on useChat with totalTokens-keyed change-detection guard to avoid no-op re-renders.
  7. docs — Append-only LLM_METADATA_DECISIONS.md recording the design decisions.

Out of scope (explicit)

  • OpenAI-named alias fields (promptTokens, etc.) — single canonical shape; consumers translate at the boundary.
  • OpenRouter prompt_tokens_details.cache_write_tokens ingestion — no consumer yet.
  • Service-tier cost multipliers (flex/priority).
  • Separate cost rate for reasoning tokens — every provider we ship currently prices reasoning at the output rate.

Test plan

  • pnpm build succeeds across the workspace (ESM/CJS dual output)
  • pnpm -r test — new assertions pass, existing usage assertions still pass
  • Manual: single-turn OpenAI call with a reasoning model — output > 0, reasoning > 0, totalTokens === input + output + cacheRead + cacheWrite
  • Manual: multi-turn tool-use call — state.totalUsage matches manual sum across turns field-for-field
  • Manual: Anthropic call with cache-primed prompt — reasoning === 0, cacheRead/cacheWrite non-zero
  • Manual: Ollama call against a model with cost set on the descriptor — cost.total > 0
  • Manual: React example app or mocked SSE — useChat().usage populates on agent_end and resets on new prompt()

yyyyaaa added 9 commits May 21, 2026 12:53
Add `Usage.reasoning: number` (required, defaults to 0) so billing/metering
consumers can distinguish pure-completion tokens from reasoning tokens. Per
the OpenAI wire contract, reasoning is a subset of `output` rather than a
sibling, so `totalTokens` keeps the existing invariant
(input + output + cacheRead + cacheWrite) — no double-counting.

Add two helpers:
- `snapshotUsage(usage)` — two-level shallow copy used by event emits so
  consumers receive a stable value independent of subsequent accumulation.
- `addUsage(target, delta)` — in-place additive accumulator for cumulative
  totals across turns. Returns the mutated target for chaining.

Align `calculateUsageCost` to return `void` (matches the per-provider local
copies; no caller used the previous `Usage['cost']` return).

Export `ZERO_USAGE` from `tools/test/fixtures.ts` so workspace tests share
a single canonical zero-usage literal instead of duplicating it.
…okens fallback

- Stop double-counting reasoning by setting usage.output to completion_tokens
  (which already includes reasoning per OpenAI's wire contract)
- Expose reasoning as a separate read-only count on usage.reasoning
- Include cacheWrite in the totalTokens fallback when total_tokens is absent
The Ollama adapter previously assigned input/output/totalTokens but never
ran the cost schedule, leaving cost.total at zero even when the model
descriptor defined per-token rates. Apply the local calculateUsageCost
helper after token assignment so the same Usage invariants hold across
providers.
The Anthropic API does not expose a reasoning-token count even when
extended thinking is enabled — thinking cost is server-side folded into
output_tokens. Initialize usage.reasoning to 0 so the field is present
and add a regression guard so we do not later populate it from a
hallucinated payload field.
…ent_end

Add totalUsage to AgentState and accumulate per-message Usage (tokens and
cost.*) as each turn completes. Snapshot the rolling total onto turn_end
and agent_end events so consumers can read a cumulative figure without
re-walking messages[]. Reset on prompt() (matching stepCount semantics),
preserve across continue().
… guard

Surface the cumulative Usage from agent_end/turn_end as a top-level
useChat field, null before the first event. Guard the setter on
totalTokens so re-render does not fire when nothing changed. Reset to
null on every new prompt(), mirroring the agent-side reset.
Capture the design decisions for the token usage metadata work that
landed in this branch (reasoning-as-subset, no provider-named aliases,
cumulative usage location, shared cost helper, etc.) as an append-only
log next to REDESIGN_DECISIONS.md.
The merge order in createModel spread builtIn.compat first then this.compat,
so the adapter's generic default (maxTokensField: 'max_tokens') silently
clobbered the model-specific override that the built-in entry sets
('max_completion_tokens' for reasoning-capable models). Result: every
reasoning model loaded via createModel sent the wrong field name and OpenAI
returned 400 "Unsupported parameter: 'max_tokens'". Same bug applied to
headers. The mock-mode unit tests didn't catch it because the mocked fetch
never validated the request body — the live smoke test caught it on the
first real call.

Swap to: adapter defaults → built-in catalog → caller overrides, so the
most specific source wins. Adds two regression tests.
Empirical-verification pass for the LLM-call metadata work (issue #907).
Until now every assertion rested on mocked SSE streams; this adds live
provider eval suites that hit real endpoints and verify three load-bearing
claims against actual wire payloads:

- OpenAI: `completion_tokens` already includes `reasoning_tokens` (so
  `output = completion_tokens`, no carve-out)
- OpenAI: `prompt_tokens_details.cached_tokens` populates on ≥1024-token
  prefix matches and surfaces as `usage.cacheRead`
- Ollama: thinking content has no associated token count; `usage.reasoning`
  must stay 0 even on a thinking-on Qwen3 turn

The reasoning-subset claim is the one that drove removing `+ reasoningTokens`
from the OpenAI output extractor — live verification confirms the wire shape
matches our assumption against `gpt-5.4-nano`.

Infrastructure:
- `tools/test/load-env.js` walks up to find a workspace `.env`; silent if
  absent so CI is unaffected
- `tools/test/live.ts` provides `liveDescribe`/`requireEnv`/`suiteLevel`
  helpers (OpenAI/Ollama/Agent suites use the equivalent inline pattern
  for full TypeScript inference; the helpers are exported for future use)
- Each suite is gated by `<NAMESPACE>_LIVE_SUITE=smoke|extended`; runner
  scripts set `*_LIVE_READY=1` which un-ignores the live test files and
  disables the global `fetch` mock in jest setup
- New root scripts: `test:live:openai{,:smoke,:extended}`,
  `test:live:agent{,:smoke,:extended}`
- Ollama smoke runner unchanged; ollama.live.test.ts extended with a new
  `Ollama live token-usage audit` describe block (4 tests)
- `.gitignore` updated to cover `.env`/`.env.local` (secrets-leak gap)

Suites are excluded from default `pnpm test` via `testPathIgnorePatterns`,
require explicit env vars, and never run in CI. See
`LLM_METADATA_DECISIONS.md` #19 for the rationale and #20 for the
adapter→builtin→override precedence bug the live tests uncovered.

Verified locally:
- `pnpm test:live:openai:extended`: 5/5 pass
- `pnpm test:live:ollama:extended`: 12/12 pass
- `pnpm test:live:agent:extended`: 3/3 pass
- `pnpm test` (default): 109/109 pass with no live env vars set
@yyyyaaa
Copy link
Copy Markdown
Contributor Author

yyyyaaa commented May 21, 2026

Live-provider eval hardening pass

Pushed two follow-up commits:

  • ff81bc8fix(openai): restore adapter → built-in → override precedence in createModel. The original spread order silently clobbered model-specific compat (notably maxTokensField: 'max_completion_tokens' for reasoning models), causing OpenAI to return 400 for every reasoning model. Mocked unit tests didn't catch it because the mock fetch never validated the body. Live smoke caught it on the first real call. Adds two regression unit tests.
  • 5118c06test(live): opt-in provider eval suites for OpenAI / Ollama / Agent. Excluded from default pnpm test, gated by *_LIVE_SUITE=smoke|extended env vars, never run in CI.

Claims now empirically verified

Claim Verified
OpenAI: output = completion_tokens (reasoning_tokens are a subset, not additive) gpt-5.4-nano w/ reasoning_effort: low
OpenAI: cacheRead populates from prompt_tokens_details.cached_tokens on ≥1024-token prefix match gpt-5.4-nano
OpenAI: totalTokens matches wire total_tokens gpt-5.4-nano
Cost rates apply (cost.total = sum(cost.*)) ✅ OpenAI + Ollama
Ollama: thinking content has no token count; usage.reasoning stays 0 qwen3:0.6b w/ think: true
Thinking events round-trip end-to-end qwen3:0.6b
Cumulative state.totalUsage equals field-wise sum across turns ✅ via Agent on gpt-5.4-nano

Local results

  • pnpm test (default, no env): 109/109 pass — live tests correctly excluded
  • pnpm test:live:openai:extended: 5/5 pass
  • pnpm test:live:ollama:extended: 12/12 pass
  • pnpm test:live:agent:extended: 3/3 pass

Decision log updates

  • LLM_METADATA_DECISIONS.md #19: live eval gating policy
  • LLM_METADATA_DECISIONS.md #20: compat precedence rule

Deferred (known gap)

  • Anthropic live eval: same metadata surface as OpenAI/Ollama, same correctness risk. Not in scope for this pass; the shared infra (tools/test/load-env.js, tools/test/live.ts, runner script template) is provider-neutral and ready to extend.

@pyramation pyramation marked this pull request as ready for review May 21, 2026 20:54
@pyramation pyramation merged commit 79067c6 into main May 21, 2026
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants