Skip to main content

Documentation Index

Fetch the complete documentation index at: https://platform.stepfun.ai/docs/llms.txt

Use this file to discover all available pages before exploring further.

This guide walks you through the core capability of step-3.7-flashnative multimodal input. You’ll learn how to have the model understand images and text together, and video and text together.
All examples use the Chat Completions API. The model has native multimodal support — no separate vision model required.

Prerequisites

1. Get an API key

Visit the console to get your API key.

2. Install dependencies

pip install --upgrade 'openai>=1.0'

Image understanding

step-3.7-flash understands images directly — no additional vision model required.

Minimal example

copy
from openai import OpenAI

client = OpenAI(
    api_key="YOUR_STEP_API_KEY",
    base_url="https://api.stepfun.ai/v1",
)

response = client.chat.completions.create(
    model="step-3.7-flash",
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "What's in this image? Describe it in detail."
            },
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://postimg.aliavv.com/step/daesog.png"
                }
            }
        ]
    }],
)

print(response.choices[0].message.content)

Use a Base64-encoded image

If your image is a local file, convert it to Base64:
import base64

def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode('utf-8')

base64_image = encode_image("your-image.jpg")

response = client.chat.completions.create(
    model="step-3.7-flash",
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "Describe this image"
            },
            {
                "type": "image_url",
                "image_url": {
                    "url": f"data:image/jpeg;base64,{base64_image}"
                }
            }
        ]
    }],
)
For images you’ll reuse, uploading to StepFun file storage speeds up access:
# 1. Upload the image
file = client.files.create(
    file=open("sample.jpg", "rb"),
    purpose="storage"
)

# 2. Use the file ID
response = client.chat.completions.create(
    model="step-3.7-flash",
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "Analyze this image"
            },
            {
                "type": "image_url",
                "image_url": {
                    "url": f"stepfile://{file.id}"
                }
            }
        ]
    }],
)
For prompt templates on whiteboard-to-plan, receipt-to-table, screenshot-to-code, and more, see the Cookbook.

Video understanding

step-3.7-flash supports native video understanding — no separate model required.
Video guidance: up to 128 MB, up to 5 minutes, MP4 format.

Minimal example

copy
from openai import OpenAI

client = OpenAI(
    api_key="YOUR_STEP_API_KEY",
    base_url="https://api.stepfun.ai/v1",
)

response = client.chat.completions.create(
    model="step-3.7-flash",
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "Summarize the main content of this video and pull out key information."
            },
            {
                "type": "video_url",
                "video_url": {
                    "url": "https://example.com/demo.mp4"
                }
            }
        ]
    }],
)

print(response.choices[0].message.content)
For prompt templates on screen-recording diagnostics, action-timeline reconstruction, and more, see the Cookbook.

Control reasoning effort

step-3.7-flash supports three reasoning effort levels — pick one based on task complexity. The Chat Completions API uses reasoning_effort; the Messages API uses output_config.effort.
EffortBest for
lowSimple Q&A, summarization, rewriting, information extraction
mediumDefault. Suitable for general reasoning and multi-step tasks
highComplex reasoning, math, planning, code analysis
copy
curl https://api.stepfun.ai/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $STEP_API_KEY" \
  -d '{
    "model": "step-3.7-flash",
    "messages": [
      {
        "role": "user",
        "content": "Explain reinforcement learning in three sentences."
      }
    ],
    "reasoning_effort": "medium",
    "max_tokens": 1024
  }'

Field reference

image_url

FieldTypeRequiredNotes
typestringyesFixed to "image_url"
urlstringyesImage source. Supports URL, Base64, and stepfile://
detailstringnoImage detail level: low (default) or high

video_url

FieldTypeRequiredNotes
typestringyesFixed to "video_url"
urlstringyesVideo source. Supports URL, Base64, and stepfile://

FAQ

Q: Video upload fails. What do I do?

A: Make sure the video meets these conditions:
  • Format: MP4, QuickTime (.mov), or Matroska (.mkv)
  • Size: under 128 MB
  • Duration: under 5 minutes
If the video exceeds the limits, you can split it with ffmpeg:
# Split into 2-minute segments
ffmpeg -i input.mp4 -c copy -f segment -segment_time 120 -reset_timestamps 1 output_%d.mp4

Q: Image / video responses are slow.

A: Upload your files to StepFun storage via the Files API and reference them with stepfile:// for faster access. For images, you can also set detail to low. For video, keep file size and duration small.

Q: How do I process multiple images in one request?

A: Pass multiple image_url items in the content array:
"content": [
    {"type": "text", "text": "Compare the differences between these two images"},
    {"type": "image_url", "image_url": {"url": "image1.jpg"}},
    {"type": "image_url", "image_url": {"url": "image2.jpg"}},
]

Q: Which image formats are supported?

A: JPG / JPEG, PNG, static GIF, and WebP.

Q: Which video formats are supported?

A: MP4, QuickTime (.mov), and Matroska (.mkv).

Next steps

Cookbook

Reusable task templates for whiteboard-to-plan, chart-to-data, receipt-to-table, and more.

Image understanding best practices

A deeper look at image understanding parameters, detail mode, and performance tips.

Video understanding best practices

A deeper look at video understanding limits, pricing estimates, and ffmpeg usage.

Reasoning model guide

Recommended usage of reasoning models for complex tasks, tool calling, and long contexts.