Skip to main content

Reasoning models

Reasoning models like step-3.5-flash are designed for tasks requiring deep logical analysis, multi-step problem solving, and long-context reasoning.
Reasoning models excel at:
  • Complex Logic: Breaking down intricate problems into manageable steps.
  • Mathematics & Coding: Solving advanced equations and debugging software.
  • Long-context Agents: Maintaining stability and reasoning over massive datasets.
See Reasoning Model for model details.

Audio models

Audio models such as step-tts-2 convert text into natural speech and support voice cloning.
Audio models can be used for tasks including, but not limited to:
  • Voice assistants: customer service and smart speakers.
  • Audiobooks and podcasts: text narration with consistent voice.
  • Games and NPCs: character voices at scale.
  • Media production: quick voiceover drafts.
See Audio Models for model details and Generate audio for the API.

Context length

Context length is the amount of input text a model considers when generating or predicting. It limits how much information the model processes in a single request.
Why it matters
  • Quality: Context length governs how much the model can remember and use, affecting understanding and generation.
  • Performance: Larger contexts can improve accuracy but also increase compute cost.
  • Cost: Longer contexts may help in certain scenarios but raise usage costs, so balance quality and spend.
Where it applies
  • Chat systems: affects coherence and context retention across turns.
  • Creative writing: longer contexts can produce more coherent, logical narratives.
  • Research papers: helps the model digest background, data, and detailed discussion.
  • Novels and literature: captures plot progression and character relationships.

Token

A token is the basic unit of text a model processes. It can be a character, word, phrase, or sentence depending on the tokenizer and training data. In Chinese, tokenization is especially important because words are not separated by spaces.
Token length
  • Chinese characters vs. tokens: Roughly 1 token equals about 1.5–2 Chinese characters, though actual counts vary by content.
Context limits
  • Maximum context: The combined input (prompt) and output must stay within the model’s context window.
  • Why the limit matters: It keeps processing efficient and avoids errors from overly long text.
Practical considerations
  • Plan text length: Fit your text within the maximum context so the model can process everything.
  • Optimize tokens: Remove unnecessary tokens or reorganize text to stay within the limit.

Rate limits

Rate limits protect service stability and fairness by capping how many requests a user can make within a given time. Three main measures:
RPM (requests per minute): Number of requests allowed per minute. If RPM is 20, you can make at most 20 requests in any rolling one-minute window. TPM (tokens per minute): Number of tokens you can send per minute across requests and responses. Many short requests may hit RPM before TPM. Concurrency (simultaneous requests): Number of in-flight requests. If the limit is 20, only 20 concurrent requests are allowed; new ones are rejected until earlier ones finish. When rate limits trigger: Any one of the above can hit first. For example:
If your limits are RPM=20 and TPM=200K, and you send 20 requests to ChatCompletions with 100 tokens each,
TPM is still under 200K, but hitting 20 requests triggers the RPM limit.