Principle
When you send a request with prompt caching enabled, the system checks if the prompt prefix has been cached from recent queries. If found, it uses the cache, reducing processing time and costs; otherwise, it processes the full prompt and caches the prefix after the response begins. This is particularly useful in the following scenarios:- Prompts containing numerous examples
- Extensive context or background information
- Repetitive tasks with consistent instructions
- Long multi-turn conversations
Core Mechanism
Different model providers have varying support for caching:Automatic Caching
Automatic caching requires no additional configuration; the system automatically identifies and caches reusable content, applicable to models like OpenAI, DeepSeek, etc.OpenAI
- Minimum prompt length: 1024 tokens
- Cost: Writing to cache is free; reading from cache costs 0.25x to 0.5x the original price
Gemini
- Implicit context caching is enabled by default, and caching is automatically effective without manual configuration.
- Caching is only effective when the content, model, and parameters are identical; any differences will be treated as a new request and will not hit the cache.
- The cache validity period is set by the developer, and it can also be left unset. If unspecified, it defaults to 1 hour. There are no minimum or maximum duration limits, and costs depend on the number of cached tokens and cache duration.
DeepSeek / Grok / Moonshot / Groq
- Cost: Writing to cache is free or at the same price, reading from cache is below the original price
Claude Model Display Caching
- Requires manual specification of the cache location via
cache_control - Allows fine-grained control over caching granularity
- Applicable to Anthropic Claude models
OpenAI Compatible Interface
You can set caching breakpoints insystem, user (including images), and tools using the cache_control field. The following examples only show the key structure:
System Message Caching (default 5 minutes TTL):
cache_control at the top level of the tool object (at the same level as type and function):
Anthropic Compatible Interface
Caching Duration
- Default: 5 minutes
- Optional: 1 hour (“ttl”: “1h”)
For more information, please refer to: Claude Prompt Caching
Usage Recommendations
- Maintain Stable Prefixes
- Cache Large Texts
- RAG data
- Long texts
- CSV / JSON data
- Role settings
- Control TTL
- Short sessions → 5 minutes
- Long sessions → 1 hour (more cost-effective)
- Reduce Cache Writes