Reasoning models
Reasoning models like step-3.5-flash are designed for tasks requiring deep logical analysis, multi-step problem solving, and long-context reasoning.
Reasoning models excel at:
- Complex Logic: Breaking down intricate problems into manageable steps.
- Mathematics & Coding: Solving advanced equations and debugging software.
- Long-context Agents: Maintaining stability and reasoning over massive datasets.
Audio models
Audio models such as step-tts-2 convert text into natural speech and support voice cloning.
Audio models can be used for tasks including, but not limited to:
- Voice assistants: customer service and smart speakers.
- Audiobooks and podcasts: text narration with consistent voice.
- Games and NPCs: character voices at scale.
- Media production: quick voiceover drafts.
Context length
Context length is the amount of input text a model considers when generating or predicting. It limits how much information the model processes in a single request.Why it matters
- Quality: Context length governs how much the model can remember and use, affecting understanding and generation.
- Performance: Larger contexts can improve accuracy but also increase compute cost.
- Cost: Longer contexts may help in certain scenarios but raise usage costs, so balance quality and spend.
- Chat systems: affects coherence and context retention across turns.
- Creative writing: longer contexts can produce more coherent, logical narratives.
- Research papers: helps the model digest background, data, and detailed discussion.
- Novels and literature: captures plot progression and character relationships.
Token
A token is the basic unit of text a model processes. It can be a character, word, phrase, or sentence depending on the tokenizer and training data. In Chinese, tokenization is especially important because words are not separated by spaces.Token length
- Chinese characters vs. tokens: Roughly 1 token equals about 1.5–2 Chinese characters, though actual counts vary by content.
- Maximum context: The combined input (prompt) and output must stay within the model’s context window.
- Why the limit matters: It keeps processing efficient and avoids errors from overly long text.
- Plan text length: Fit your text within the maximum context so the model can process everything.
- Optimize tokens: Remove unnecessary tokens or reorganize text to stay within the limit.
Rate limits
Rate limits protect service stability and fairness by capping how many requests a user can make within a given time. Three main measures:RPM (requests per minute): Number of requests allowed per minute. If RPM is 20, you can make at most 20 requests in any rolling one-minute window. TPM (tokens per minute): Number of tokens you can send per minute across requests and responses. Many short requests may hit RPM before TPM. Concurrency (simultaneous requests): Number of in-flight requests. If the limit is 20, only 20 concurrent requests are allowed; new ones are rejected until earlier ones finish. When rate limits trigger: Any one of the above can hit first. For example: