Skip to content

[FEATURE] Add prompt caching support for providers(Currently for Anthropic and Bedrock) #706

@arunkumarry

Description

@arunkumarry

Scope check

  • This is core LLM communication (not application logic)
  • This benefits most users (not just my use case)
  • This can't be solved in application code with current RubyLLM
  • I read the Contributing Guide

Due diligence

  • I searched existing issues
  • I checked the documentation

What problem does this solve?

When building applications that repeatedly send large static context (system prompts, RAG documents, tool definitions) alongside dynamic user queries, every request re-sends the full token count. Both Anthropic and Gemini offer provider level caching mechanisms to avoid this, but ruby_llm has no way to express cache points today.

Proposed solution

Add a cache_point: true keyword to Chat#with_instructions and Chat#ask that marks the static portion of a prompt as cacheable. The gem handles the provider specific implementation transparently:

Anthropic — injects cache_control: { type: 'ephemeral' } on the last content block of cache-pointed messages (up to 4 breakpoints per request)
Gemini — uploads static messages to the Context Caching API on first call, stores the cachedContent name on the chat object, and references it in subsequent generateContent requests instead of re-sending inline.

Anthropic input tokens usage.

First call -

Image

Next call -

Image

Why this belongs in RubyLLM

Prompt caching cannot be implemented in application code using existing RubyLLM primitives. Here's why:

For example Anthropic requires cache_control: { type: 'ephemeral' } to be injected into specific content blocks inside the formatted message payload. The payload structure is entirely internal to Anthropic::Chat#render_payload — application code has no way to reach inside and modify individual content blocks after formatting. It also requires the anthropic-beta: prompt-caching-2024-07-31 request header, which can't be conditionally added based on message content from outside the gem.

The user facing API is a single keyword: cache_point: true on with_instructions or ask. That's it. The complexity is entirely hidden inside the provider layer where it belongs, consistent with how RubyLLM already abstracts streaming, tool calls, thinking tokens, and structured output across providers. Application code stays clean and provider agnostic.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions