constructive-io · pyramation · May 21, 2026 · May 21, 2026 · May 21, 2026 · May 21, 2026
diff --git a/.gitignore b/.gitignore
@@ -4,4 +4,8 @@
 **/yarn-error.log
 lerna-debug.log
 **/src/*.js
-**/src/*.d.ts
+**/src/*.d.ts
+.env
+.env.local
+**/.env
+**/.env.local
diff --git a/LLM_METADATA_DECISIONS.md b/LLM_METADATA_DECISIONS.md
@@ -0,0 +1,180 @@
+# LLM Call Metadata Decisions
+
+Date: 2026-05-21
+
+Companion to [`REDESIGN_DECISIONS.md`](./REDESIGN_DECISIONS.md). Records the design
+choices behind the token-usage / cost / call-metadata surface that billing and
+metering consumers depend on. Append-only — each new decision gets the next
+number; older entries stay as-is even when superseded (superseding entries
+explicitly reference the entry they replace).
+
+Tracking issue: [constructive-planning #907](https://github.com/constructive-io/constructive-planning/issues/907).
+
+## Usage shape
+
+1. **Reasoning is a subset of `output`, not a sibling.** `output` keeps the
+   `completion_tokens` value the provider reports (which already includes
+   reasoning per OpenAI's wire contract), and `reasoning` is exposed as a
+   separate read-only count. The `totalTokens` invariant remains
+   `input + output + cacheRead + cacheWrite` — adding `reasoning` to the total
+   would double-count, since the provider already folded it into
+   `completion_tokens` upstream. Billing derives pure-completion tokens as
+   `output - reasoning` when it needs a separate rate.
+
+2. **Anthropic `reasoning` stays zero.** The Anthropic Messages API does not
+   expose a reasoning-token count even when extended thinking is on; the cost
+   of thinking blocks is server-side folded into `output_tokens`. We do not
+   fabricate a value or estimate from thinking-content character counts.
+
+3. **Ollama `reasoning` stays zero.** Ollama's native API reports only
+   `prompt_eval_count` and `eval_count`; there is no reasoning breakdown.
+   Same policy as Anthropic — leave the field at zero rather than guess.
+
+4. **No OpenAI-named alias fields on `Usage`.** The canonical shape stays
+   `input` / `output` / `reasoning` / `cacheRead` / `cacheWrite` /
+   `totalTokens`. Billing and downstream consumers translate at the boundary
+   (`prompt_tokens → input`, `completion_tokens → output`,
+   `reasoning_tokens → reasoning`, `total_tokens → totalTokens`, plus the
+   cache fields). Adding aliases would either duplicate state or invite drift.
+
+5. **No separate cost rate for reasoning tokens.** Reasoning cost is folded
+   into the output rate via `model.cost.output`. Every model we currently ship
+   prices reasoning at the same rate as output. Add a `model.cost.reasoning`
+   schedule field only when we onboard a model that prices reasoning
+   separately.
+
+## Aggregation surface
+
+6. **Cumulative usage lives on `AgentState.totalUsage` and on the
+   `agent_end` and `turn_end` events.** Reset on `prompt()`, preserved across
+   `continue()` — matching `stepCount` semantics. Consumers should not have
+   to re-walk `messages[]` to derive a sum we already compute. Per-message
+   usage remains accessible at `messages[i].usage`.
+
+7. **`useChat` exposes a single `usage` field (cumulative).** The React hook
+   surfaces `usage: Usage | null`, populated from `turn_end`/`agent_end`
+   events and reset to `null` on each new `prompt()`. Advanced consumers can
+   still inspect per-message usage by walking `messages`.
+
+## Provider implementation
+
+8. **Each provider package is standalone — no runtime dependency on
+   `agentic-kit` core.** `packages/anthropic`, `packages/openai`, and
+   `packages/ollama` each inline their own copies of the shared types
+   (`Usage`, `Message`, `ModelDescriptor`, etc.) and their own
+   `calculateUsageCost` helper. This is deliberate: provider packages must
+   be drop-in usable without pulling the agentic-kit hub. Sync between the
+   canonical type in `packages/agentic-kit/src/types.ts` and the per-provider
+   copies is a maintenance cost we accept. Any change to `Usage` must land in
+   all four locations. Earlier plan drafts proposed lifting
+   `calculateUsageCost` to the shared package and importing it everywhere —
+   that proposal is rejected here. (Only `packages/agent` depends on
+   `agentic-kit`; it imports `addUsage` from the hub for cumulative-usage
+   accumulation.)
+
+9. **Ollama calls a local `calculateUsageCost` on the final payload.** Prior
+   to this change, the Ollama adapter set `usage.input`/`usage.output`/
+   `totalTokens` but never invoked any cost calculator — so `cost.total`
+   stayed at zero even when `model.cost` was populated. Fixed by adding a
+   local `calculateUsageCost` helper (mirroring the ones in
+   `packages/anthropic` and `packages/openai`) and calling it in
+   `processPayload` after token counts are assigned.
+
+10. **OpenAI no longer double-counts `reasoning_tokens` into `output`.**
+    Previously, `applyUsage` did
+    `output = completion_tokens + reasoning_tokens` — but
+    `completion_tokens` already includes reasoning per OpenAI's contract.
+    Now: `output = completion_tokens`, `reasoning = reasoning_tokens`.
+
+11. **OpenAI `totalTokens` fallback includes `cacheWrite`.** Prior fallback
+    was `prompt_tokens ?? (input + output + cacheRead)` — missing `cacheWrite`.
+    Currently a no-op for stock OpenAI (which doesn't emit cache writes), but
+    breaks the invariant for OpenAI-compatible endpoints (OpenRouter) that
+    do.
+
+12. **OpenRouter `prompt_tokens_details.cache_write_tokens` ingestion is
+    deferred.** No billing consumer currently asks for it. When a consumer
+    materializes, we add the read in `applyUsage` and the cost rate in the
+    relevant model descriptor — both small. Tracking under #907 follow-up.
+
+## Streaming and abort semantics
+
+13. **Anthropic writes `usage.input` at `message_start`, and overwrites on
+    `message_delta`.** This is intentional: it ensures input-token counts
+    survive an early stream abort (caller has the input cost even if the
+    completion never finishes). OpenAI providers only emit usage at the
+    terminal chunk, so an aborted OpenAI stream yields all-zero usage; this
+    is a provider-API limit, not something we paper over.
+
+## Out of scope (deferred, not declined)
+
+14. **Service-tier cost multipliers (OpenAI Responses API
+    `flex`/`priority`).** Not on the agentic-kit roadmap until we add the
+    Responses-API adapter. Pi-mono applies these as a post-hoc multiplier
+    on `usage.cost.*`; we'll follow the same pattern when needed.
+
+15. **Audio-token counts.** No consumer; add when speech I/O lands.
+
+16. **Per-session persistence / write-through to a database.** Billing's
+    consumer pulls from the event stream; storage is downstream of this
+    package's concern.
+
+17. **`totalUsage` on event emits is a shallow snapshot, not a live reference.**
+    The `turn_end` and `agent_end` events attach
+    `{ ...this._state.totalUsage, cost: { ...this._state.totalUsage.cost } }`
+    rather than the mutable state object directly. Why: `agent_end` already
+    does `[...this._state.messages]` (a shallow array copy) for the same
+    reason — listeners receive a stable value that won't change if the agent
+    continues running. `Usage` is a two-level object (`cost` is a nested
+    object literal), so the copy must be two levels deep. A full deep clone
+    (`JSON.parse(JSON.stringify(...))`) was rejected as overkill for a flat
+    numeric object; `structuredClone` was rejected as unnecessary verbosity
+    for the same reason. Downstream SSE serialisation (which JSON-serialises
+    the event anyway) would have made a live reference safe in practice, but
+    the shallow-copy convention is consistent with the `messages` precedent
+    and makes the event contract independent of the serialisation path.
+
+18. **`useChat` resets `usage` at the start of `runStream`, not at the
+    `send` / `sendMessages` / `respondWithDecision` call sites.** All three
+    entry-points flow through `runStream`, so the reset is centralised there.
+    This avoids three separate call-site edits and ensures the reset fires
+    unconditionally for every new request — including decision-resume
+    requests via `respondWithDecision`. Mirrors the agent-side rule from
+    decision #6 (reset on each new request, not on `continue()`).
+
+19. **Live provider eval suites are opt-in, `.env`-loaded, excluded from
+    default `pnpm test` via `testPathIgnorePatterns`, and never run in CI.**
+    Three suites land: `packages/openai/__tests__/openai.live.test.ts`,
+    `packages/ollama/__tests__/ollama.live.test.ts` (extended with a new
+    `Ollama live token-usage audit` block), and
+    `packages/agent/__tests__/agent.live.test.ts`. Each suite is gated by
+    `<NAMESPACE>_LIVE_SUITE=smoke|extended` (e.g. `OPENAI_LIVE_SUITE`); the
+    `pnpm test:live:<provider>{,:smoke,:extended}` runners set
+    `*_LIVE_READY=1` which both un-ignores the file in Jest config and
+    disables the `global.fetch = jest.fn()` mock in `openai/jest.setup.js`.
+    A shared `tools/test/load-env.js` walks up to find a workspace `.env`
+    and is silent if absent, so CI is unaffected. Why: empirical wire-shape
+    verification is the only way to confirm load-bearing claims like
+    "`completion_tokens` already includes `reasoning_tokens`" — but live
+    suites are expensive (real tokens) and require secrets, so they must
+    stay out of the default loop. How to apply: when changing usage
+    extraction, header construction, or any wire-shape detail, run the
+    matching `pnpm test:live:*:extended` locally before merging. The
+    `.gitignore` was updated to cover `.env` / `.env.local` to close a
+    secrets-leak gap.
+
+20. **Adapter-default `compat` must be the base layer of `createModel`'s
+    merge, not the override layer.** The original spread order was
+    `{ ...builtIn.compat, ...this.compat, ...overrides.compat }`, which
+    silently clobbered model-specific settings (notably
+    `maxTokensField: 'max_completion_tokens'` for reasoning-capable models)
+    with the adapter's generic default (`'max_tokens'`). OpenAI returned
+    400 (`Unsupported parameter: 'max_tokens'`) for `gpt-5.4-nano`. The
+    mock-mode unit tests didn't catch it because the mocked `fetch` never
+    validated the body. The live smoke test caught it on the very first
+    real call. Why: model-specific knowledge in the built-in catalog is
+    more authoritative than weak adapter defaults; user-provided overrides
+    are most authoritative of all. How to apply: spread order is now
+    `{ ...this.compat, ...builtIn.compat, ...overrides.compat }` — same
+    rule for `headers`. Same precedence rule should be applied any time a
+    new merge of compat-like fields is introduced.
diff --git a/apps/tanstack-chat-demo/src/lib/use-chat.ts b/apps/tanstack-chat-demo/src/lib/use-chat.ts
@@ -76,6 +76,7 @@ export function useChat() {
                 usage: {
                   input: 0,
                   output: 0,
+                  reasoning: 0,
                   cacheRead: 0,
                   cacheWrite: 0,
                   totalTokens: 0,

diff --git a/package.json b/package.json
@@ -22,6 +22,12 @@
     "typecheck": "node ./scripts/typecheck.js",
     "test:live:ollama": "pnpm --filter @agentic-kit/ollama run test:live:smoke",
     "test:live:ollama:extended": "pnpm --filter @agentic-kit/ollama run test:live:extended",
+    "test:live:openai": "pnpm --filter @agentic-kit/openai run test:live:smoke",
+    "test:live:openai:smoke": "pnpm --filter @agentic-kit/openai run test:live:smoke",
+    "test:live:openai:extended": "pnpm --filter @agentic-kit/openai run test:live:extended",
+    "test:live:agent": "pnpm --filter @agentic-kit/agent run test:live:smoke",
+    "test:live:agent:smoke": "pnpm --filter @agentic-kit/agent run test:live:smoke",
+    "test:live:agent:extended": "pnpm --filter @agentic-kit/agent run test:live:extended",
     "lint": "pnpm -r run lint",
     "internal:deps": "makage update-workspace",
     "deps": "pnpm up -r -i -L"
@@ -32,6 +38,7 @@
     "@types/node": "^20.12.7",
     "@typescript-eslint/eslint-plugin": "^8.58.2",
     "@typescript-eslint/parser": "^8.58.2",
+    "dotenv": "^16.4.5",
     "eslint": "^9.39.2",
     "eslint-config-prettier": "^10.1.8",
     "eslint-plugin-simple-import-sort": "^12.1.0",

diff --git a/packages/agent/__tests__/agent.live.test.ts b/packages/agent/__tests__/agent.live.test.ts
@@ -0,0 +1,123 @@
+import { OpenAIAdapter } from '@agentic-kit/openai';
+import { createUserMessage, type AssistantMessage } from 'agentic-kit';
+
+import { Agent } from '../src';
+
+const modelId = process.env.OPENAI_LIVE_MODEL ?? 'gpt-5.4-nano';
+const apiKey = process.env.OPENAI_API_KEY;
+
+if (!apiKey) {
+  throw new Error('Missing required env var: OPENAI_API_KEY');
+}
+
+const liveSuite = process.env.AGENT_LIVE_SUITE ?? 'smoke';
+const runSmoke = liveSuite === 'smoke' || liveSuite === 'extended';
+const runExtended = liveSuite === 'extended';
+const describeSmoke = runSmoke ? describe : describe.skip;
+const describeExtended = runExtended ? describe : describe.skip;
+
+describeSmoke('Agent live smoke', () => {
+  jest.setTimeout(60_000);
+
+  it('single turn populates state.totalUsage from the assistant message', async () => {
+    const adapter = new OpenAIAdapter({ apiKey });
+    const model = adapter.createModel(modelId);
+    const agent = new Agent({ initialState: { model }, streamFn: adapter.stream.bind(adapter) });
+
+    await agent.prompt('Reply with the single word PONG.');
+
+    expect(agent.state.totalUsage.input).toBeGreaterThan(0);
+    expect(agent.state.totalUsage.output).toBeGreaterThan(0);
+    expect(agent.state.totalUsage.totalTokens).toBeGreaterThan(0);
+    expect(agent.state.totalUsage.cost.total).toBeGreaterThan(0);
+
+    const lastAssistant = agent.state.messages
+      .filter((m): m is AssistantMessage => m.role === 'assistant')
+      .at(-1)!;
+
+    // Single turn: the per-message usage IS the cumulative total.
+    expect(agent.state.totalUsage.input).toBe(lastAssistant.usage.input);
+    expect(agent.state.totalUsage.output).toBe(lastAssistant.usage.output);
+    expect(agent.state.totalUsage.reasoning).toBe(lastAssistant.usage.reasoning);
+    expect(agent.state.totalUsage.cacheRead).toBe(lastAssistant.usage.cacheRead);
+    expect(agent.state.totalUsage.cacheWrite).toBe(lastAssistant.usage.cacheWrite);
+    expect(agent.state.totalUsage.totalTokens).toBe(lastAssistant.usage.totalTokens);
+  });
+});
+
+describeExtended('Agent live extended', () => {
+  jest.setTimeout(120_000);
+
+  it('state.totalUsage equals field-wise sum across two turns', async () => {
+    const adapter = new OpenAIAdapter({ apiKey });
+    const model = adapter.createModel(modelId);
+    const agent = new Agent({ initialState: { model }, streamFn: adapter.stream.bind(adapter) });
+
+    await agent.prompt('What is 2 + 2? Reply with just the number.');
+
+    const t1Usage = {
+      ...agent.state.totalUsage,
+      cost: { ...agent.state.totalUsage.cost },
+    };
+
+    // continue() does not accept text; append the follow-up user message first.
+    agent.appendMessage(createUserMessage('Now what is that doubled? Reply with just the number.'));
+    await agent.continue();
+
+    const lastAssistant = agent.state.messages
+      .filter((m): m is AssistantMessage => m.role === 'assistant')
+      .at(-1)!;
+
+    expect(agent.state.totalUsage.input).toBe(t1Usage.input + lastAssistant.usage.input);
+    expect(agent.state.totalUsage.output).toBe(t1Usage.output + lastAssistant.usage.output);
+    expect(agent.state.totalUsage.reasoning).toBe(t1Usage.reasoning + lastAssistant.usage.reasoning);
+    expect(agent.state.totalUsage.cacheRead).toBe(t1Usage.cacheRead + lastAssistant.usage.cacheRead);
+    expect(agent.state.totalUsage.cacheWrite).toBe(t1Usage.cacheWrite + lastAssistant.usage.cacheWrite);
+    expect(agent.state.totalUsage.totalTokens).toBe(t1Usage.totalTokens + lastAssistant.usage.totalTokens);
+    expect(agent.state.totalUsage.cost.input).toBeCloseTo(
+      t1Usage.cost.input + lastAssistant.usage.cost.input,
+      10
+    );
+    expect(agent.state.totalUsage.cost.output).toBeCloseTo(
+      t1Usage.cost.output + lastAssistant.usage.cost.output,
+      10
+    );
+    expect(agent.state.totalUsage.cost.total).toBeCloseTo(
+      t1Usage.cost.total + lastAssistant.usage.cost.total,
+      10
+    );
+  });
+
+  it('prompt() resets totalUsage; continue() preserves it', async () => {
+    const adapter = new OpenAIAdapter({ apiKey });
+    const model = adapter.createModel(modelId);
+    const agent = new Agent({ initialState: { model }, streamFn: adapter.stream.bind(adapter) });
+
+    await agent.prompt('Reply with the single word A.');
+    const firstTotals = { ...agent.state.totalUsage, cost: { ...agent.state.totalUsage.cost } };
+
+    agent.appendMessage(createUserMessage('Reply with the single word B.'));
+    await agent.continue();
+    const secondTotals = { ...agent.state.totalUsage, cost: { ...agent.state.totalUsage.cost } };
+
+    // continue() must not reset — totals should have grown.
+    expect(secondTotals.input).toBeGreaterThanOrEqual(firstTotals.input);
+    expect(secondTotals.totalTokens).toBeGreaterThanOrEqual(firstTotals.totalTokens);
+    expect(agent.state.totalUsage.input).toBeGreaterThanOrEqual(firstTotals.input);
+
+    await agent.prompt('Reply with the single word C.');
+
+    const thirdAssistant = agent.state.messages
+      .filter((m): m is AssistantMessage => m.role === 'assistant')
+      .at(-1)!;
+
+    // prompt() resets: the new total should be one turn's worth, not cumulative
+    // across all three. We use < rather than === because token counts vary and
+    // we cannot pin the exact value — only that it did not carry over the prior
+    // two turns' worth of input tokens.
+    expect(agent.state.totalUsage.input).toBeLessThan(secondTotals.input + 100);
+    expect(agent.state.totalUsage.totalTokens).toBe(thirdAssistant.usage.totalTokens);
+    expect(agent.state.totalUsage.input).toBe(thirdAssistant.usage.input);
+    expect(agent.state.totalUsage.output).toBe(thirdAssistant.usage.output);
+  });
+});