Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,8 @@
**/yarn-error.log
lerna-debug.log
**/src/*.js
**/src/*.d.ts
**/src/*.d.ts
.env
.env.local
**/.env
**/.env.local
180 changes: 180 additions & 0 deletions LLM_METADATA_DECISIONS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,180 @@
# LLM Call Metadata Decisions

Date: 2026-05-21

Companion to [`REDESIGN_DECISIONS.md`](./REDESIGN_DECISIONS.md). Records the design
choices behind the token-usage / cost / call-metadata surface that billing and
metering consumers depend on. Append-only — each new decision gets the next
number; older entries stay as-is even when superseded (superseding entries
explicitly reference the entry they replace).

Tracking issue: [constructive-planning #907](https://github.com/constructive-io/constructive-planning/issues/907).

## Usage shape

1. **Reasoning is a subset of `output`, not a sibling.** `output` keeps the
`completion_tokens` value the provider reports (which already includes
reasoning per OpenAI's wire contract), and `reasoning` is exposed as a
separate read-only count. The `totalTokens` invariant remains
`input + output + cacheRead + cacheWrite` — adding `reasoning` to the total
would double-count, since the provider already folded it into
`completion_tokens` upstream. Billing derives pure-completion tokens as
`output - reasoning` when it needs a separate rate.

2. **Anthropic `reasoning` stays zero.** The Anthropic Messages API does not
expose a reasoning-token count even when extended thinking is on; the cost
of thinking blocks is server-side folded into `output_tokens`. We do not
fabricate a value or estimate from thinking-content character counts.

3. **Ollama `reasoning` stays zero.** Ollama's native API reports only
`prompt_eval_count` and `eval_count`; there is no reasoning breakdown.
Same policy as Anthropic — leave the field at zero rather than guess.

4. **No OpenAI-named alias fields on `Usage`.** The canonical shape stays
`input` / `output` / `reasoning` / `cacheRead` / `cacheWrite` /
`totalTokens`. Billing and downstream consumers translate at the boundary
(`prompt_tokens → input`, `completion_tokens → output`,
`reasoning_tokens → reasoning`, `total_tokens → totalTokens`, plus the
cache fields). Adding aliases would either duplicate state or invite drift.

5. **No separate cost rate for reasoning tokens.** Reasoning cost is folded
into the output rate via `model.cost.output`. Every model we currently ship
prices reasoning at the same rate as output. Add a `model.cost.reasoning`
schedule field only when we onboard a model that prices reasoning
separately.

## Aggregation surface

6. **Cumulative usage lives on `AgentState.totalUsage` and on the
`agent_end` and `turn_end` events.** Reset on `prompt()`, preserved across
`continue()` — matching `stepCount` semantics. Consumers should not have
to re-walk `messages[]` to derive a sum we already compute. Per-message
usage remains accessible at `messages[i].usage`.

7. **`useChat` exposes a single `usage` field (cumulative).** The React hook
surfaces `usage: Usage | null`, populated from `turn_end`/`agent_end`
events and reset to `null` on each new `prompt()`. Advanced consumers can
still inspect per-message usage by walking `messages`.

## Provider implementation

8. **Each provider package is standalone — no runtime dependency on
`agentic-kit` core.** `packages/anthropic`, `packages/openai`, and
`packages/ollama` each inline their own copies of the shared types
(`Usage`, `Message`, `ModelDescriptor`, etc.) and their own
`calculateUsageCost` helper. This is deliberate: provider packages must
be drop-in usable without pulling the agentic-kit hub. Sync between the
canonical type in `packages/agentic-kit/src/types.ts` and the per-provider
copies is a maintenance cost we accept. Any change to `Usage` must land in
all four locations. Earlier plan drafts proposed lifting
`calculateUsageCost` to the shared package and importing it everywhere —
that proposal is rejected here. (Only `packages/agent` depends on
`agentic-kit`; it imports `addUsage` from the hub for cumulative-usage
accumulation.)

9. **Ollama calls a local `calculateUsageCost` on the final payload.** Prior
to this change, the Ollama adapter set `usage.input`/`usage.output`/
`totalTokens` but never invoked any cost calculator — so `cost.total`
stayed at zero even when `model.cost` was populated. Fixed by adding a
local `calculateUsageCost` helper (mirroring the ones in
`packages/anthropic` and `packages/openai`) and calling it in
`processPayload` after token counts are assigned.

10. **OpenAI no longer double-counts `reasoning_tokens` into `output`.**
Previously, `applyUsage` did
`output = completion_tokens + reasoning_tokens` — but
`completion_tokens` already includes reasoning per OpenAI's contract.
Now: `output = completion_tokens`, `reasoning = reasoning_tokens`.

11. **OpenAI `totalTokens` fallback includes `cacheWrite`.** Prior fallback
was `prompt_tokens ?? (input + output + cacheRead)` — missing `cacheWrite`.
Currently a no-op for stock OpenAI (which doesn't emit cache writes), but
breaks the invariant for OpenAI-compatible endpoints (OpenRouter) that
do.

12. **OpenRouter `prompt_tokens_details.cache_write_tokens` ingestion is
deferred.** No billing consumer currently asks for it. When a consumer
materializes, we add the read in `applyUsage` and the cost rate in the
relevant model descriptor — both small. Tracking under #907 follow-up.

## Streaming and abort semantics

13. **Anthropic writes `usage.input` at `message_start`, and overwrites on
`message_delta`.** This is intentional: it ensures input-token counts
survive an early stream abort (caller has the input cost even if the
completion never finishes). OpenAI providers only emit usage at the
terminal chunk, so an aborted OpenAI stream yields all-zero usage; this
is a provider-API limit, not something we paper over.

## Out of scope (deferred, not declined)

14. **Service-tier cost multipliers (OpenAI Responses API
`flex`/`priority`).** Not on the agentic-kit roadmap until we add the
Responses-API adapter. Pi-mono applies these as a post-hoc multiplier
on `usage.cost.*`; we'll follow the same pattern when needed.

15. **Audio-token counts.** No consumer; add when speech I/O lands.

16. **Per-session persistence / write-through to a database.** Billing's
consumer pulls from the event stream; storage is downstream of this
package's concern.

17. **`totalUsage` on event emits is a shallow snapshot, not a live reference.**
The `turn_end` and `agent_end` events attach
`{ ...this._state.totalUsage, cost: { ...this._state.totalUsage.cost } }`
rather than the mutable state object directly. Why: `agent_end` already
does `[...this._state.messages]` (a shallow array copy) for the same
reason — listeners receive a stable value that won't change if the agent
continues running. `Usage` is a two-level object (`cost` is a nested
object literal), so the copy must be two levels deep. A full deep clone
(`JSON.parse(JSON.stringify(...))`) was rejected as overkill for a flat
numeric object; `structuredClone` was rejected as unnecessary verbosity
for the same reason. Downstream SSE serialisation (which JSON-serialises
the event anyway) would have made a live reference safe in practice, but
the shallow-copy convention is consistent with the `messages` precedent
and makes the event contract independent of the serialisation path.

18. **`useChat` resets `usage` at the start of `runStream`, not at the
`send` / `sendMessages` / `respondWithDecision` call sites.** All three
entry-points flow through `runStream`, so the reset is centralised there.
This avoids three separate call-site edits and ensures the reset fires
unconditionally for every new request — including decision-resume
requests via `respondWithDecision`. Mirrors the agent-side rule from
decision #6 (reset on each new request, not on `continue()`).

19. **Live provider eval suites are opt-in, `.env`-loaded, excluded from
default `pnpm test` via `testPathIgnorePatterns`, and never run in CI.**
Three suites land: `packages/openai/__tests__/openai.live.test.ts`,
`packages/ollama/__tests__/ollama.live.test.ts` (extended with a new
`Ollama live token-usage audit` block), and
`packages/agent/__tests__/agent.live.test.ts`. Each suite is gated by
`<NAMESPACE>_LIVE_SUITE=smoke|extended` (e.g. `OPENAI_LIVE_SUITE`); the
`pnpm test:live:<provider>{,:smoke,:extended}` runners set
`*_LIVE_READY=1` which both un-ignores the file in Jest config and
disables the `global.fetch = jest.fn()` mock in `openai/jest.setup.js`.
A shared `tools/test/load-env.js` walks up to find a workspace `.env`
and is silent if absent, so CI is unaffected. Why: empirical wire-shape
verification is the only way to confirm load-bearing claims like
"`completion_tokens` already includes `reasoning_tokens`" — but live
suites are expensive (real tokens) and require secrets, so they must
stay out of the default loop. How to apply: when changing usage
extraction, header construction, or any wire-shape detail, run the
matching `pnpm test:live:*:extended` locally before merging. The
`.gitignore` was updated to cover `.env` / `.env.local` to close a
secrets-leak gap.

20. **Adapter-default `compat` must be the base layer of `createModel`'s
merge, not the override layer.** The original spread order was
`{ ...builtIn.compat, ...this.compat, ...overrides.compat }`, which
silently clobbered model-specific settings (notably
`maxTokensField: 'max_completion_tokens'` for reasoning-capable models)
with the adapter's generic default (`'max_tokens'`). OpenAI returned
400 (`Unsupported parameter: 'max_tokens'`) for `gpt-5.4-nano`. The
mock-mode unit tests didn't catch it because the mocked `fetch` never
validated the body. The live smoke test caught it on the very first
real call. Why: model-specific knowledge in the built-in catalog is
more authoritative than weak adapter defaults; user-provided overrides
are most authoritative of all. How to apply: spread order is now
`{ ...this.compat, ...builtIn.compat, ...overrides.compat }` — same
rule for `headers`. Same precedence rule should be applied any time a
new merge of compat-like fields is introduced.
1 change: 1 addition & 0 deletions apps/tanstack-chat-demo/src/lib/use-chat.ts
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,7 @@ export function useChat() {
usage: {
input: 0,
output: 0,
reasoning: 0,
cacheRead: 0,
cacheWrite: 0,
totalTokens: 0,
Expand Down
7 changes: 7 additions & 0 deletions package.json
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,12 @@
"typecheck": "node ./scripts/typecheck.js",
"test:live:ollama": "pnpm --filter @agentic-kit/ollama run test:live:smoke",
"test:live:ollama:extended": "pnpm --filter @agentic-kit/ollama run test:live:extended",
"test:live:openai": "pnpm --filter @agentic-kit/openai run test:live:smoke",
"test:live:openai:smoke": "pnpm --filter @agentic-kit/openai run test:live:smoke",
"test:live:openai:extended": "pnpm --filter @agentic-kit/openai run test:live:extended",
"test:live:agent": "pnpm --filter @agentic-kit/agent run test:live:smoke",
"test:live:agent:smoke": "pnpm --filter @agentic-kit/agent run test:live:smoke",
"test:live:agent:extended": "pnpm --filter @agentic-kit/agent run test:live:extended",
"lint": "pnpm -r run lint",
"internal:deps": "makage update-workspace",
"deps": "pnpm up -r -i -L"
Expand All @@ -32,6 +38,7 @@
"@types/node": "^20.12.7",
"@typescript-eslint/eslint-plugin": "^8.58.2",
"@typescript-eslint/parser": "^8.58.2",
"dotenv": "^16.4.5",
"eslint": "^9.39.2",
"eslint-config-prettier": "^10.1.8",
"eslint-plugin-simple-import-sort": "^12.1.0",
Expand Down
123 changes: 123 additions & 0 deletions packages/agent/__tests__/agent.live.test.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
import { OpenAIAdapter } from '@agentic-kit/openai';
import { createUserMessage, type AssistantMessage } from 'agentic-kit';

import { Agent } from '../src';

const modelId = process.env.OPENAI_LIVE_MODEL ?? 'gpt-5.4-nano';
const apiKey = process.env.OPENAI_API_KEY;

if (!apiKey) {
throw new Error('Missing required env var: OPENAI_API_KEY');
}

const liveSuite = process.env.AGENT_LIVE_SUITE ?? 'smoke';
const runSmoke = liveSuite === 'smoke' || liveSuite === 'extended';
const runExtended = liveSuite === 'extended';
const describeSmoke = runSmoke ? describe : describe.skip;
const describeExtended = runExtended ? describe : describe.skip;

describeSmoke('Agent live smoke', () => {
jest.setTimeout(60_000);

it('single turn populates state.totalUsage from the assistant message', async () => {
const adapter = new OpenAIAdapter({ apiKey });
const model = adapter.createModel(modelId);
const agent = new Agent({ initialState: { model }, streamFn: adapter.stream.bind(adapter) });

await agent.prompt('Reply with the single word PONG.');

expect(agent.state.totalUsage.input).toBeGreaterThan(0);
expect(agent.state.totalUsage.output).toBeGreaterThan(0);
expect(agent.state.totalUsage.totalTokens).toBeGreaterThan(0);
expect(agent.state.totalUsage.cost.total).toBeGreaterThan(0);

const lastAssistant = agent.state.messages
.filter((m): m is AssistantMessage => m.role === 'assistant')
.at(-1)!;

// Single turn: the per-message usage IS the cumulative total.
expect(agent.state.totalUsage.input).toBe(lastAssistant.usage.input);
expect(agent.state.totalUsage.output).toBe(lastAssistant.usage.output);
expect(agent.state.totalUsage.reasoning).toBe(lastAssistant.usage.reasoning);
expect(agent.state.totalUsage.cacheRead).toBe(lastAssistant.usage.cacheRead);
expect(agent.state.totalUsage.cacheWrite).toBe(lastAssistant.usage.cacheWrite);
expect(agent.state.totalUsage.totalTokens).toBe(lastAssistant.usage.totalTokens);
});
});

describeExtended('Agent live extended', () => {
jest.setTimeout(120_000);

it('state.totalUsage equals field-wise sum across two turns', async () => {
const adapter = new OpenAIAdapter({ apiKey });
const model = adapter.createModel(modelId);
const agent = new Agent({ initialState: { model }, streamFn: adapter.stream.bind(adapter) });

await agent.prompt('What is 2 + 2? Reply with just the number.');

const t1Usage = {
...agent.state.totalUsage,
cost: { ...agent.state.totalUsage.cost },
};

// continue() does not accept text; append the follow-up user message first.
agent.appendMessage(createUserMessage('Now what is that doubled? Reply with just the number.'));
await agent.continue();

const lastAssistant = agent.state.messages
.filter((m): m is AssistantMessage => m.role === 'assistant')
.at(-1)!;

expect(agent.state.totalUsage.input).toBe(t1Usage.input + lastAssistant.usage.input);
expect(agent.state.totalUsage.output).toBe(t1Usage.output + lastAssistant.usage.output);
expect(agent.state.totalUsage.reasoning).toBe(t1Usage.reasoning + lastAssistant.usage.reasoning);
expect(agent.state.totalUsage.cacheRead).toBe(t1Usage.cacheRead + lastAssistant.usage.cacheRead);
expect(agent.state.totalUsage.cacheWrite).toBe(t1Usage.cacheWrite + lastAssistant.usage.cacheWrite);
expect(agent.state.totalUsage.totalTokens).toBe(t1Usage.totalTokens + lastAssistant.usage.totalTokens);
expect(agent.state.totalUsage.cost.input).toBeCloseTo(
t1Usage.cost.input + lastAssistant.usage.cost.input,
10
);
expect(agent.state.totalUsage.cost.output).toBeCloseTo(
t1Usage.cost.output + lastAssistant.usage.cost.output,
10
);
expect(agent.state.totalUsage.cost.total).toBeCloseTo(
t1Usage.cost.total + lastAssistant.usage.cost.total,
10
);
});

it('prompt() resets totalUsage; continue() preserves it', async () => {
const adapter = new OpenAIAdapter({ apiKey });
const model = adapter.createModel(modelId);
const agent = new Agent({ initialState: { model }, streamFn: adapter.stream.bind(adapter) });

await agent.prompt('Reply with the single word A.');
const firstTotals = { ...agent.state.totalUsage, cost: { ...agent.state.totalUsage.cost } };

agent.appendMessage(createUserMessage('Reply with the single word B.'));
await agent.continue();
const secondTotals = { ...agent.state.totalUsage, cost: { ...agent.state.totalUsage.cost } };

// continue() must not reset — totals should have grown.
expect(secondTotals.input).toBeGreaterThanOrEqual(firstTotals.input);
expect(secondTotals.totalTokens).toBeGreaterThanOrEqual(firstTotals.totalTokens);
expect(agent.state.totalUsage.input).toBeGreaterThanOrEqual(firstTotals.input);

await agent.prompt('Reply with the single word C.');

const thirdAssistant = agent.state.messages
.filter((m): m is AssistantMessage => m.role === 'assistant')
.at(-1)!;

// prompt() resets: the new total should be one turn's worth, not cumulative
// across all three. We use < rather than === because token counts vary and
// we cannot pin the exact value — only that it did not carry over the prior
// two turns' worth of input tokens.
expect(agent.state.totalUsage.input).toBeLessThan(secondTotals.input + 100);
expect(agent.state.totalUsage.totalTokens).toBe(thirdAssistant.usage.totalTokens);
expect(agent.state.totalUsage.input).toBe(thirdAssistant.usage.input);
expect(agent.state.totalUsage.output).toBe(thirdAssistant.usage.output);
});
});
Loading
Loading