Skip to main content
When building multi-turn conversations or fixed persona Q&A, you pass conversation history on every turn. As turns grow, repeated content accumulates. Stepfun provides context caching: each request’s input is cached, and repeated requests automatically hit the cache. This speeds up inference (up to 90% faster time-to-first-token in some cases) and reduces cost. Cached tokens are billed at 20% of the corresponding model’s token price.

Supported models

Prompt caching is supported on the step-1, step-1v, step-1.5v, and step-2 series. Other models do not currently support it.

How prompt caching works

Caching automatically activates when a request exceeds 256 tokens. When you call the API:
  1. Cache lookup: the system checks for a cached prefix of your prompt.
  2. Cache hit: if found, the cache is reused and only the uncached portion is processed.
  3. Cache miss: if not found, the request is processed normally and the prompt prefix is cached for future use.
Cache hit rates are higher for prompts with multiple examples, multi-turn history, or long context/background sections. Eviction uses LRU: during peak traffic, unused cache entries are evicted more aggressively; off-peak they live longer.

How to tell if caching was used

If response.usage contains cached_tokens, the request hit the cache. cached_tokens shows the token count served from cache.
{
    "id": "22ebae159c8f10c8657253671c8f7f17.6de8d4f65f5ba3dddeb693a1aa83de1e",
    "object": "chat.completion",
    "created": 1730963954,
    "model": "step-1o-turbo-vision",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "xxx"
            },
            "finish_reason": "stop"
        }
    ],
    "usage": {
        "cached_tokens": 512,
        "prompt_tokens": 591,
        "completion_tokens": 120,
        "total_tokens": 711
    }
}
In this example, without caching the billable input would be 591 tokens. With caching, it becomes 591 - 512 = 79 tokens—a reduction of 88%.

What is cached

Caching uses 256 tokens as the minimum unit. Prompts shorter than 256 tokens will not hit the cache. The following content can be cached:
  • Conversation messages: the full array, including system prompt, user messages, and assistant messages.
  • Images: images within user messages, as long as the same images are reused.
  • Video: videos within user messages, with the same constraint.
  • Tool calls and results: tool calls inside the conversation and their outputs.

Prompt caching tips

  • Put static or repeated content at the beginning of the prompt, and place dynamic parts at the end.
  • Monitor ratios of cached_tokens vs prompt_tokens, latency, and hit rate to optimize prompt structure.
  • Cache entries unused recently may be evicted. Keep the prompt prefix stable to improve hit rate.
  • For long prompts, send them during off-peak hours to reduce evictions and improve hit rate.

Troubleshooting cache misses

If a request is not caching as expected, check the following:
  1. Ensure the cached portion stays identical across calls.
  2. Verify calls are not spaced so far apart that the cache expires.
  3. Provide at least 256 tokens of input.

FAQ

Does caching affect model quality?

Caching does not affect model quality. Each generation still uses the full prompt.

Do I pay extra for prompt caching?

Cached tokens are billed at 20% of the normal token price for that model.

Can I clear the cache manually?

Manual cache clearing is not supported. Modify the prompt if you need to avoid cache hits.

Can I force every request to hit the cache?

Guaranteed cache hits are not supported. If you need this capability, contact us to discuss your scenario.