Context cache - Alibaba Cloud Model Studio - Alibaba Cloud Documentation Center

When using text generation models, your input from different inference requests may overlap, such as in multi-round conversations or multiple questions about the same subject. The context cache feature can cache the overlapping prefix content of the requests, reducing redundant computations during inference. It enhances response speed and saves costs without affecting response quality.

Supported models

Currently, qwen-max, qwen-plus, and qwen-turbo support context cache.

Snapshot and latest models are not supported.

Feature overview

How to use

When you send a request to supported model, context cache is automatically activated. The system checks whether the prefix of the request is stored in the cache and provides the result of the cache hit.

Cache information that remains unused for a certain period will be periodically cleared.

Note

The hit ratio of context cache hit is not 100%. Even with identical contexts, cache misses may occur.

Billing details

Enabling context cache does not require additional payment. If the system determines that your request hits the cache, the hit tokens will be charged as cached_token. The tokens that are not hit will be charged as input_token. The unit price of cached_token is 40% of the unit price of input_token.

Suppose you send a request of 10,000 tokens and the system determines 5,000 tokens have hit the cache. Then the total charge is:

output_token is charged at the original price.

The cached_tokens property of the return result indicates the number of tokens that hit the cache.

If you use the OpenAI compatible - Batch mode, the discount of context cache is not available.

How to increase the probability of cache hit

The system determines cache hit by checking whether duplicate content exists in the prefix of requests. As a result, to enhance the probability of cache hit, you should put the common content at the start of the prompt and the unique content at the end.

For example, the system has cached "ABCD," a request for "ABE" may hit the cache but "BCD" will not hit.

Content shorter than 256 tokens will not be cached.

How it works

Search
When receiving a request, the system searches for the prompt's prefix in the cache.
Determine
1. Hit
  If the cache is hit, the system uses the cached result for inference.
2. Miss
  If the cache is missed, the system processes the request normally, and the prompt's prefix is cached for future use.

Cache hit cases

OpenAI compatible

DashScope

When you use the OpenAI SDK to call a model and hit the cache, the following sample response is returned. The usage.prompt_tokens_details.cached_tokens property shows the number of tokens that hit the cache. The usage.prompt_tokens property shows the total number of tokens, which includes cached_tokens.

{
    "choices": [
        {
            "message": {
                "role": "assistant",
                "content": "I am a large language model developed by Alibaba Cloud, called Qwen."
            },
            "finish_reason": "stop",
            "index": 0,
            "logprobs": null
        }
    ],
    "object": "chat.completion",
    "usage": {
        "prompt_tokens": 3019,
        "completion_tokens": 104,
        "total_tokens": 3123,
        "prompt_tokens_details": {
            "cached_tokens": 2048
        }
    },
    "created": 1735120033,
    "system_fingerprint": null,
    "model": "qwen-plus",
    "id": "chatcmpl-6ada9ed2-7f33-9de2-8bb0-78bd4035025a"
}

When you use the DashScope SDK for Python or the HTTP method to call a model and hit the cache, the following sample response is returned. The usage.prompt_tokens_details.cached_tokens property shows the number of tokens that hit the cache. The usage.input_tokens property shows the total number of input tokens, which includes cached_tokens.

The DashScope SDK for Java also supports context cache, but currently does not display cached_tokens.

{
    "status_code": 200,
    "request_id": "f3acaa33-e248-97bb-96d5-cbeed34699e1",
    "code": "",
    "message": "",
    "output": {
        "text": null,
        "finish_reason": null,
        "choices": [
            {
                "finish_reason": "stop",
                "message": {
                    "role": "assistant",
                    "content": "I am a large language model from Alibaba Cloud, called Qwen. I can generate various types of text, such as articles, stories, poems, and more, and can transform and extend according to different scenarios and needs. Additionally, I can answer various questions, provide assistance, and offer solutions. If you have any questions or need help, please feel free to let me know, and I will do my best to provide support. Please note that continuously repeating the same content may not result in more detailed answers. It is recommended to provide more specific information or change the way of asking questions to better understand your needs."
                }
            }
        ]
    },
    "usage": {
        "input_tokens": 3019,
        "output_tokens": 101,
        "prompt_tokens_details": {
            "cached_tokens": 2048
        },
        "total_tokens": 3120
    }
}

Typical scenarios

Context cache can significantly improve inference speed, reduce costs, and decrease latency for requests with identical prefixes. Here are some common use cases:

Q&A based on long text

Scenarios involving repeated requests for lengthy texts that do not change, such as stories, textbooks, and legal documents.

Messages array for the first request

messages = [{"role": "system","content": "You are an English teacher who can help students with reading comprehension."},
          {"role": "user","content": "<article content> What thoughts and feelings does the author express in this text?"}]

Messages array for subsequent requests

messages = [{"role": "system","content": "You are an English teacher who can help students with reading comprehension."},
          {"role": "user","content": "<article content> Please analysis the third paragraph of this text."}]

In this case, different questions are asked based on the same article. The shared prefix (system prompt and article content) makes user request more likely to hit the cache.

Code auto-completion
In this case, the model automatically complement code based on the context. While you are coding, the preceding part of the code remains unchanged. The feature can cache the preceding code in history to enhance the speed of completion.

Multi-round conversation

Multi-round conversations incorporate each round's dialogue into the messages array, creating consistent prefixes. This enhances the probability of cache hits.

Messages array for the first round

messages=[{"role": "system","content": "You are a helpful assistant."},
          {"role": "user","content": "Who are you?"}]

Messages array for the second round

messages=[{"role": "system","content": "You are a helpful assistant."},
          {"role": "user","content": "Who are you?"},
          {"role": "assistant","content": "I am Qwen, developed by Alibaba Cloud."},
          {"role": "user","content": "What can you do?"}]

The speed and cost benefits of context cache become more evident as conversation rounds increase.

Role-playing or few-shot learning

These scenarios often require extensive prompts to guide the model's output, leading to repeated prefix information across requests.

For example, to role-play a marketing expert, the system prompt may contain extensive text.

Messages array for two requests

system_prompt = """You are an experienced marketing expert. Please provide detailed marketing suggestions for different products in the following format:

1. Target audience: xxx

2. Key selling points: xxx

3. Marketing channels: xxx
...
12. Long-term development strategy: xxx

Please ensure your suggestions are specific, actionable, and highly relevant to the product features."""

# User message for the first request asking about a smartwatch
messages_1=[
  {"role": "system", "content": system_prompt},
  {"role": "user", "content": "Please provide marketing suggestions for a newly launched smartwatch."}
]

# User message for the second request asking about a laptop, with a high probability of hitting the Cache due to the same system_prompt
messages_2=[
  {"role": "system", "content": system_prompt},
  {"role": "user", "content": "Please provide marketing suggestions for a newly launched laptop."}
]

Context Cache allows for swift responses even if the user changes the product type frequently, in this case, from smartwatch to laptop.