Skip to main content

Documentation Index

Fetch the complete documentation index at: https://platform.stepfun.ai/docs/llms.txt

Use this file to discover all available pages before exploring further.

Model overview

Vision models add image and video input to our text models for more complete and accurate understanding and reasoning. We currently offer the step series of vision models:

Models

step-1o-turbo-vision

Recommended. Strong image and video understanding. Currently supports text, image, and video input with text-only output. Better math/code performance than step-1o-vision-32k, smaller size, faster output, 32k context window.

step-1o-vision-32k

Powerful image understanding with text and image input; text-only output. 32k context. Stronger visual performance than the step-1v series.

step-1v

Strong image understanding with text and image input; text-only output. Context windows: 8k and 32k.

Key terms

  1. Image resolution: Pixel width/height. Higher resolution conveys more detail but increases cost, latency, and transfer time. Keep longest side under 4096px.
  2. Image token count: Depends on resolution; images are adaptively scaled to an optimal size.
  3. Supported formats: JPG/JPEG, PNG, static GIF, WebP.
  4. URL formats:
    • http/https: Must be accessible from mainland China; load time affects first-token latency.
    • base64: Follow RFC2394, e.g. data:image/jpeg;base64,<base64_data_string>.
    • References: RFC2397, Data URL Format

Usage limits

  1. Images per request: Beyond context limits, step-1v models cap requests at 50 images. For long conversations, summarize images first via multimodal models, then pass summaries as text.
  2. Total image size: Keep combined uploads within 20MB.
  3. Image metadata: Metadata (path, filename, size, original resolution, author, camera model, location, etc.) is stripped before inference to protect privacy; images are also resized to optimal dimensions.
  4. Small text: Tiny fonts may reduce recognition quality.
  5. Rotation/cropping: Incomplete or misaligned images can hurt recognition.
  6. Counting: Numeric outputs are estimates, not exact counts.
  7. Accuracy: Descriptions or captions may be imperfect; avoid relying on outputs where errors have serious consequences.

Quickstart

Migrate from OpenAI

Reuse OpenAI-compatible SDK patterns to adopt Stepfun vision models quickly.

Image understanding

Send images in the conversation and build grounded multimodal interactions.

Video understanding

Pass video links to the model so it can read and reason over video content.

Multi-turn conversations

Maintain context over multiple turns for continuous multimodal conversations.

JSON Mode

Return structured JSON outputs for downstream application workflows.

Streaming responses

Stream output progressively to improve perceived latency in the UI.

Tool Call

Combine model reasoning with tools, functions, and external data sources.

Prompt cache

Cache repeated context to optimize longer or repeated multimodal requests.