Documentation Index
Fetch the complete documentation index at: https://platform.stepfun.ai/docs/llms.txt
Use this file to discover all available pages before exploring further.
Model overview
Vision models add image and video input to our text models for more complete and accurate understanding and reasoning. We currently offer the step series of vision models:Models
step-1o-turbo-vision
Recommended. Strong image and video understanding. Currently supports text, image, and video input with text-only output. Better math/code performance than step-1o-vision-32k, smaller size, faster output, 32k context window.step-1o-vision-32k
Powerful image understanding with text and image input; text-only output. 32k context. Stronger visual performance than the step-1v series.step-1v
Strong image understanding with text and image input; text-only output. Context windows: 8k and 32k.Key terms
- Image resolution: Pixel width/height. Higher resolution conveys more detail but increases cost, latency, and transfer time. Keep longest side under 4096px.
- Image token count: Depends on resolution; images are adaptively scaled to an optimal size.
- Supported formats: JPG/JPEG, PNG, static GIF, WebP.
- URL formats:
- http/https: Must be accessible from mainland China; load time affects first-token latency.
- base64: Follow RFC2394, e.g.
data:image/jpeg;base64,<base64_data_string>. - References: RFC2397, Data URL Format
Usage limits
- Images per request: Beyond context limits, step-1v models cap requests at 50 images. For long conversations, summarize images first via multimodal models, then pass summaries as text.
- Total image size: Keep combined uploads within 20MB.
- Image metadata: Metadata (path, filename, size, original resolution, author, camera model, location, etc.) is stripped before inference to protect privacy; images are also resized to optimal dimensions.
- Small text: Tiny fonts may reduce recognition quality.
- Rotation/cropping: Incomplete or misaligned images can hurt recognition.
- Counting: Numeric outputs are estimates, not exact counts.
- Accuracy: Descriptions or captions may be imperfect; avoid relying on outputs where errors have serious consequences.
Quickstart
Migrate from OpenAI
Reuse OpenAI-compatible SDK patterns to adopt Stepfun vision models quickly.
Image understanding
Send images in the conversation and build grounded multimodal interactions.
Video understanding
Pass video links to the model so it can read and reason over video content.
Multi-turn conversations
Maintain context over multiple turns for continuous multimodal conversations.
JSON Mode
Return structured JSON outputs for downstream application workflows.
Streaming responses
Stream output progressively to improve perceived latency in the UI.
Tool Call
Combine model reasoning with tools, functions, and external data sources.
Prompt cache
Cache repeated context to optimize longer or repeated multimodal requests.