Scope check
Due diligence
What problem does this solve?
When building applications that repeatedly send large static context (system prompts, RAG documents, tool definitions) alongside dynamic user queries, every request re-sends the full token count. Both Anthropic and Gemini offer provider level caching mechanisms to avoid this, but ruby_llm has no way to express cache points today.
Proposed solution
Add a cache_point: true keyword to Chat#with_instructions and Chat#ask that marks the static portion of a prompt as cacheable. The gem handles the provider specific implementation transparently:
Anthropic — injects cache_control: { type: 'ephemeral' } on the last content block of cache-pointed messages (up to 4 breakpoints per request)
Gemini — uploads static messages to the Context Caching API on first call, stores the cachedContent name on the chat object, and references it in subsequent generateContent requests instead of re-sending inline.
Anthropic input tokens usage.
First call -
Next call -
Why this belongs in RubyLLM
Prompt caching cannot be implemented in application code using existing RubyLLM primitives. Here's why:
For example Anthropic requires cache_control: { type: 'ephemeral' } to be injected into specific content blocks inside the formatted message payload. The payload structure is entirely internal to Anthropic::Chat#render_payload — application code has no way to reach inside and modify individual content blocks after formatting. It also requires the anthropic-beta: prompt-caching-2024-07-31 request header, which can't be conditionally added based on message content from outside the gem.
The user facing API is a single keyword: cache_point: true on with_instructions or ask. That's it. The complexity is entirely hidden inside the provider layer where it belongs, consistent with how RubyLLM already abstracts streaming, tool calls, thinking tokens, and structured output across providers. Application code stays clean and provider agnostic.
Scope check
Due diligence
What problem does this solve?
When building applications that repeatedly send large static context (system prompts, RAG documents, tool definitions) alongside dynamic user queries, every request re-sends the full token count. Both Anthropic and Gemini offer provider level caching mechanisms to avoid this, but ruby_llm has no way to express cache points today.
Proposed solution
Add a
cache_point: truekeyword toChat#with_instructionsandChat#askthat marks the static portion of a prompt as cacheable. The gem handles the provider specific implementation transparently:Anthropic — injects
cache_control: { type: 'ephemeral' }on the last content block of cache-pointed messages (up to 4 breakpoints per request)Gemini — uploads static messages to the Context Caching API on first call, stores the cachedContent name on the chat object, and references it in subsequent generateContent requests instead of re-sending inline.
Anthropic input tokens usage.
First call -
Next call -
Why this belongs in RubyLLM
Prompt caching cannot be implemented in application code using existing RubyLLM primitives. Here's why:
For example Anthropic requires
cache_control: { type: 'ephemeral' }to be injected into specific content blocks inside the formatted message payload. The payload structure is entirely internal toAnthropic::Chat#render_payload— application code has no way to reach inside and modify individual content blocks after formatting. It also requires theanthropic-beta: prompt-caching-2024-07-31request header, which can't be conditionally added based on message content from outside the gem.The user facing API is a single keyword:
cache_point: trueonwith_instructionsorask. That's it. The complexity is entirely hidden inside the provider layer where it belongs, consistent with how RubyLLM already abstracts streaming, tool calls, thinking tokens, and structured output across providers. Application code stays clean and provider agnostic.