Documentation Index
Fetch the complete documentation index at: https://platform.stepfun.ai/docs/llms.txt
Use this file to discover all available pages before exploring further.
This guide walks you through the core capability of step-3.7-flash — native multimodal input. You’ll learn how to have the model understand images and text together, and video and text together.
All examples use the Chat Completions API. The model has native multimodal support — no separate vision model required.
Prerequisites
1. Get an API key
Visit the console to get your API key.
2. Install dependencies
pip install --upgrade 'openai>=1.0'
Image understanding
step-3.7-flash understands images directly — no additional vision model required.
Minimal example
from openai import OpenAI
client = OpenAI(
api_key="YOUR_STEP_API_KEY",
base_url="https://api.stepfun.ai/v1",
)
response = client.chat.completions.create(
model="step-3.7-flash",
messages=[{
"role": "user",
"content": [
{
"type": "text",
"text": "What's in this image? Describe it in detail."
},
{
"type": "image_url",
"image_url": {
"url": "https://postimg.aliavv.com/step/daesog.png"
}
}
]
}],
)
print(response.choices[0].message.content)
curl https://api.stepfun.ai/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $STEP_API_KEY" \
-d '{
"model": "step-3.7-flash",
"messages": [{
"role": "user",
"content": [
{
"type": "text",
"text": "What is in this image? Describe it in detail."
},
{
"type": "image_url",
"image_url": {
"url": "https://postimg.aliavv.com/step/daesog.png"
}
}
]
}]
}'
Use a Base64-encoded image
If your image is a local file, convert it to Base64:
import base64
def encode_image(image_path):
with open(image_path, "rb") as image_file:
return base64.b64encode(image_file.read()).decode('utf-8')
base64_image = encode_image("your-image.jpg")
response = client.chat.completions.create(
model="step-3.7-flash",
messages=[{
"role": "user",
"content": [
{
"type": "text",
"text": "Describe this image"
},
{
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{base64_image}"
}
}
]
}],
)
Use the Files API (recommended)
For images you’ll reuse, uploading to StepFun file storage speeds up access:
# 1. Upload the image
file = client.files.create(
file=open("sample.jpg", "rb"),
purpose="storage"
)
# 2. Use the file ID
response = client.chat.completions.create(
model="step-3.7-flash",
messages=[{
"role": "user",
"content": [
{
"type": "text",
"text": "Analyze this image"
},
{
"type": "image_url",
"image_url": {
"url": f"stepfile://{file.id}"
}
}
]
}],
)
For prompt templates on whiteboard-to-plan, receipt-to-table, screenshot-to-code, and more, see the Cookbook.
Video understanding
step-3.7-flash supports native video understanding — no separate model required.
Video guidance: up to 128 MB, up to 5 minutes, MP4 format.
Minimal example
from openai import OpenAI
client = OpenAI(
api_key="YOUR_STEP_API_KEY",
base_url="https://api.stepfun.ai/v1",
)
response = client.chat.completions.create(
model="step-3.7-flash",
messages=[{
"role": "user",
"content": [
{
"type": "text",
"text": "Summarize the main content of this video and pull out key information."
},
{
"type": "video_url",
"video_url": {
"url": "https://example.com/demo.mp4"
}
}
]
}],
)
print(response.choices[0].message.content)
curl https://api.stepfun.ai/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $STEP_API_KEY" \
-d '{
"model": "step-3.7-flash",
"messages": [{
"role": "user",
"content": [
{
"type": "text",
"text": "Summarize the main content of this video and pull out key information."
},
{
"type": "video_url",
"video_url": {
"url": "https://example.com/demo.mp4"
}
}
]
}]
}'
For prompt templates on screen-recording diagnostics, action-timeline reconstruction, and more, see the Cookbook.
Control reasoning effort
step-3.7-flash supports three reasoning effort levels — pick one based on task complexity. The Chat Completions API uses reasoning_effort; the Messages API uses output_config.effort.
| Effort | Best for |
|---|
low | Simple Q&A, summarization, rewriting, information extraction |
medium | Default. Suitable for general reasoning and multi-step tasks |
high | Complex reasoning, math, planning, code analysis |
Chat Completions API
Messages API
Python
curl https://api.stepfun.ai/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $STEP_API_KEY" \
-d '{
"model": "step-3.7-flash",
"messages": [
{
"role": "user",
"content": "Explain reinforcement learning in three sentences."
}
],
"reasoning_effort": "medium",
"max_tokens": 1024
}'
curl https://api.stepfun.ai/v1/messages \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $STEP_API_KEY" \
-d '{
"model": "step-3.7-flash",
"max_tokens": 1024,
"messages": [
{
"role": "user",
"content": "Explain reinforcement learning in three sentences."
}
],
"output_config": {
"effort": "medium"
}
}'
response = client.chat.completions.create(
model="step-3.7-flash",
messages=[{
"role": "user",
"content": "Analyze the trends and outliers in this data chart"
}],
reasoning_effort="high", # Use high reasoning effort
max_tokens=2048,
)
Field reference
image_url
| Field | Type | Required | Notes |
|---|
type | string | yes | Fixed to "image_url" |
url | string | yes | Image source. Supports URL, Base64, and stepfile:// |
detail | string | no | Image detail level: low (default) or high |
video_url
| Field | Type | Required | Notes |
|---|
type | string | yes | Fixed to "video_url" |
url | string | yes | Video source. Supports URL, Base64, and stepfile:// |
FAQ
Q: Video upload fails. What do I do?
A: Make sure the video meets these conditions:
- Format: MP4, QuickTime (
.mov), or Matroska (.mkv)
- Size: under 128 MB
- Duration: under 5 minutes
If the video exceeds the limits, you can split it with ffmpeg:
# Split into 2-minute segments
ffmpeg -i input.mp4 -c copy -f segment -segment_time 120 -reset_timestamps 1 output_%d.mp4
Q: Image / video responses are slow.
A: Upload your files to StepFun storage via the Files API and reference them with stepfile:// for faster access. For images, you can also set detail to low. For video, keep file size and duration small.
Q: How do I process multiple images in one request?
A: Pass multiple image_url items in the content array:
"content": [
{"type": "text", "text": "Compare the differences between these two images"},
{"type": "image_url", "image_url": {"url": "image1.jpg"}},
{"type": "image_url", "image_url": {"url": "image2.jpg"}},
]
A: JPG / JPEG, PNG, static GIF, and WebP.
A: MP4, QuickTime (.mov), and Matroska (.mkv).
Next steps
Cookbook
Reusable task templates for whiteboard-to-plan, chart-to-data, receipt-to-table, and more.
Image understanding best practices
A deeper look at image understanding parameters, detail mode, and performance tips.
Video understanding best practices
A deeper look at video understanding limits, pricing estimates, and ffmpeg usage.
Reasoning model guide
Recommended usage of reasoning models for complex tasks, tool calling, and long contexts.