Vision Models - StepFun Documentation

Models

Step 3.7 Flash

Recommended. Our flagship multimodal reasoning model with native image and video understanding — no extra vision MCP or auxiliary model required. It handles image / video Q&A and cross-modal analysis directly, with three reasoning effort levels (low / medium / high) and a 256K context window. See Step 3.7 Flash.

Key terms

Image resolution: Pixel width/height. Higher resolution conveys more detail but increases cost, latency, and transfer time. Keep longest side under 4096px.

Image token count: Depends on resolution; images are adaptively scaled to an optimal size.

Supported formats: JPG/JPEG, PNG, static GIF, WebP.

URL formats:

http/https: Must be accessible from mainland China; load time affects first-token latency.
base64: Follow RFC2394, e.g. data:image/jpeg;base64,<base64_data_string>.
References: RFC2397, Data URL Format

Usage limits

Images per request: The number of images per request is bounded by the model’s context length. For long conversations, summarize images first via a multimodal model, then pass the summaries as text.

Total image size: Keep combined uploads within 20MB.

Image metadata: Metadata (path, filename, size, original resolution, author, camera model, location, etc.) is stripped before inference to protect privacy; images are also resized to optimal dimensions.

Small text: Tiny fonts may reduce recognition quality.

Rotation/cropping: Incomplete or misaligned images can hurt recognition.

Counting: Numeric outputs are estimates, not exact counts.

Accuracy: Descriptions or captions may be imperfect; avoid relying on outputs where errors have serious consequences.

Quickstart

Migrate from OpenAI

Reuse OpenAI-compatible SDK patterns to adopt Stepfun vision models quickly.

Image understanding

Send images in the conversation and build grounded multimodal interactions.

Video understanding

Pass video links to the model so it can read and reason over video content.

Multi-turn conversations

Maintain context over multiple turns for continuous multimodal conversations.

JSON Mode

Return structured JSON outputs for downstream application workflows.

Streaming responses

Stream output progressively to improve perceived latency in the UI.

Tool Call

Combine model reasoning with tools, functions, and external data sources.

Prompt cache

Cache repeated context to optimize longer or repeated multimodal requests.

​Model overview

​Models

​Step 3.7 Flash

​Key terms

​Usage limits

​Quickstart