AiHubMix Documentation Hub

Prompt Caching is an important mechanism used to reduce model inference costs. By caching previously processed prompt content, it can be reused in subsequent requests, thereby reducing redundant computations, lowering costs, and improving response efficiency.

Principle

When you send a request with prompt caching enabled, the system checks if the prompt prefix has been cached from recent queries. If found, it uses the cache, reducing processing time and costs; otherwise, it processes the full prompt and caches the prefix after the response begins. This is particularly useful in the following scenarios:

Prompts containing numerous examples
Extensive context or background information
Repetitive tasks with consistent instructions
Long multi-turn conversations

Core Mechanism

Different model providers have varying support for caching:

Automatic Caching

Automatic caching requires no additional configuration; the system automatically identifies and caches reusable content, applicable to models like OpenAI, DeepSeek, etc.

OpenAI

Minimum prompt length: 1024 tokens
Cost: Writing to cache is free; reading from cache costs 0.25x to 0.5x the original price

Gemini

Implicit context caching is enabled by default, and caching is automatically effective without manual configuration.
Caching is only effective when the content, model, and parameters are identical; any differences will be treated as a new request and will not hit the cache.
The cache validity period is set by the developer, and it can also be left unset. If unspecified, it defaults to 1 hour. There are no minimum or maximum duration limits, and costs depend on the number of cached tokens and cache duration.

DeepSeek / Grok / Moonshot / Groq

Cost: Writing to cache is free or at the same price, reading from cache is below the original price

Claude Model Display Caching

Requires manual specification of the cache location via cache_control
Allows fine-grained control over caching granularity
Applicable to Anthropic Claude models

OpenAI Compatible Interface

You can set caching breakpoints in system, user (including images), and tools using the cache_control field. The following examples only show the key structure: System Message Caching (default 5 minutes TTL):

{
  "model": "claude-opus-4-5",
  "messages": [
    {
      "role": "system",
      "content": [
        {"type": "text", "text": "You are an AI assistant"},
        {
          "type": "text",
          "text": "(long context)",
          "cache_control": {"type": "ephemeral"}
        }
      ]
    },
    {
      "role": "user",
      "content": [{"type": "text", "text": "Hello"}]
    }
  ]
}

User Message Caching (1 hour TTL):

{
  "model": "claude-opus-4-5",
  "messages": [
    {
      "role": "system",
      "content": [{"type": "text", "text": "You are an AI assistant"}]
    },
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "(long context)",
          "cache_control": {"type": "ephemeral", "ttl": "1h"}
        },
        {"type": "text", "text": "Hello"}
      ]
    }
  ]
}

Image Message Caching:

{
  "role": "user",
  "content": [
    {
      "type": "image_url",
      "image_url": {"detail": "auto", "url": "data:image/jpeg;base64,/9j/4AAQ..."},
      "cache_control": {"type": "ephemeral"}
    },
    {"type": "text", "text": "What's this?"}
  ]
}

Tool Definition Caching: Place the cache_control at the top level of the tool object (at the same level as type and function):

{
  "tools": [{
    "type": "function",
    "function": {
      "name": "get_weather",
      "description": "Get current weather for a location",
      "parameters": {
        "type": "object",
        "properties": {"city": {"type": "string"}},
        "required": ["city"]
      }
    },
    "cache_control": {"type": "ephemeral", "ttl": "1h"}
  }]
}

Anthropic Compatible Interface

curl https://aihubmix.com/v1/messages \
  -H "content-type: application/json" \
  -H "x-api-key: $AIHUBMIX_API_KEY" \
  -H "anthropic-version: 2023-06-01" \
  -d '{
    "model": "claude-opus-4-6",
    "max_tokens": 1024,
    "system": [
      {
        "type": "text",
        "text": "You are an AI assistant tasked with analyzing literary works. Your goal is to provide insightful commentary on themes, characters, and writing style.\n"
      },
      {
        "type": "text",
        "text": "<the entire contents of Pride and Prejudice>",
        "cache_control": {"type": "ephemeral"}
      }
    ],
    "messages": [
      {
        "role": "user",
        "content": "Analyze the major themes in Pride and Prejudice."
      }
    ]
  }'

# Call the model again with the same input until the caching checkpoint
curl https://aihubmix.com/v1/messages # rest of input

Caching Duration

Default: 5 minutes
Optional: 1 hour (“ttl”: “1h”)

For more information, please refer to: Claude Prompt Caching

Usage Recommendations

Maintain Stable Prefixes

Place fixed content at the beginning of the prompt, recommended structure:

[System Settings / Long Text / RAG Data] 
[User Question (Variable Part)]

Cache Large Texts

Prioritize caching the following content:

RAG data
Long texts
CSV / JSON data
Role settings

Control TTL

Short sessions → 5 minutes
Long sessions → 1 hour (more cost-effective)

Reduce Cache Writes

Avoid frequently changing content from entering the cache. Do not cache timestamps, user input variables, high-frequency changing data, etc.

Last updated: 2026-06-01

​Principle

​Core Mechanism

​Automatic Caching

​OpenAI

​Gemini

​DeepSeek / Grok / Moonshot / Groq

​Claude Model Display Caching

​OpenAI Compatible Interface

​Anthropic Compatible Interface

​Caching Duration

​Usage Recommendations

​Avoid frequently changing content from entering the cache. Do not cache timestamps, user input variables, high-frequency changing data, etc.