Supported models
Prompt caching is supported on the step-1, step-1v, step-1.5v, and step-2 series. Other models do not currently support it.How prompt caching works
Caching automatically activates when a request exceeds 256 tokens. When you call the API:- Cache lookup: the system checks for a cached prefix of your prompt.
- Cache hit: if found, the cache is reused and only the uncached portion is processed.
- Cache miss: if not found, the request is processed normally and the prompt prefix is cached for future use.
How to tell if caching was used
Ifresponse.usage contains cached_tokens, the request hit the cache. cached_tokens shows the token count served from cache.
What is cached
Caching uses 256 tokens as the minimum unit. Prompts shorter than 256 tokens will not hit the cache. The following content can be cached:- Conversation messages: the full array, including system prompt, user messages, and assistant messages.
- Images: images within user messages, as long as the same images are reused.
- Video: videos within user messages, with the same constraint.
- Tool calls and results: tool calls inside the conversation and their outputs.
Prompt caching tips
- Put static or repeated content at the beginning of the prompt, and place dynamic parts at the end.
- Monitor ratios of
cached_tokensvsprompt_tokens, latency, and hit rate to optimize prompt structure. - Cache entries unused recently may be evicted. Keep the prompt prefix stable to improve hit rate.
- For long prompts, send them during off-peak hours to reduce evictions and improve hit rate.
Troubleshooting cache misses
If a request is not caching as expected, check the following:- Ensure the cached portion stays identical across calls.
- Verify calls are not spaced so far apart that the cache expires.
- Provide at least 256 tokens of input.