Quickstart - StepFun Documentation

This guide walks you through the core capability of step-3.7-flash — native multimodal input. You’ll learn how to have the model understand images and text together, and video and text together.

All examples use the Chat Completions API. The model has native multimodal support — no separate vision model required.

Prerequisites

1. Get an API key

Visit the console to get your API key.

2. Install dependencies

pip install --upgrade 'openai>=1.0'

Image understanding

step-3.7-flash understands images directly — no additional vision model required.

Minimal example

Python
curl

copy

from openai import OpenAI

client = OpenAI(
    api_key="YOUR_STEP_API_KEY",
    base_url="https://api.stepfun.ai/v1",
)

response = client.chat.completions.create(
    model="step-3.7-flash",
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "What's in this image? Describe it in detail."
            },
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://postimg.aliavv.com/step/daesog.png"
                }
            }
        ]
    }],
)

print(response.choices[0].message.content)

copy

curl https://api.stepfun.ai/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $STEP_API_KEY" \
  -d '{
    "model": "step-3.7-flash",
    "messages": [{
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "What is in this image? Describe it in detail."
        },
        {
          "type": "image_url",
          "image_url": {
            "url": "https://postimg.aliavv.com/step/daesog.png"
          }
        }
      ]
    }]
  }'

Use a Base64-encoded image

If your image is a local file, convert it to Base64:

import base64

def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode('utf-8')

base64_image = encode_image("your-image.jpg")

response = client.chat.completions.create(
    model="step-3.7-flash",
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "Describe this image"
            },
            {
                "type": "image_url",
                "image_url": {
                    "url": f"data:image/jpeg;base64,{base64_image}"
                }
            }
        ]
    }],
)

Use the Files API (recommended)

For images you’ll reuse, uploading to StepFun file storage speeds up access:

# 1. Upload the image
file = client.files.create(
    file=open("sample.jpg", "rb"),
    purpose="storage"
)

# 2. Use the file ID
response = client.chat.completions.create(
    model="step-3.7-flash",
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "Analyze this image"
            },
            {
                "type": "image_url",
                "image_url": {
                    "url": f"stepfile://{file.id}"
                }
            }
        ]
    }],
)

For prompt templates on whiteboard-to-plan, receipt-to-table, screenshot-to-code, and more, see the Cookbook.

Video understanding

step-3.7-flash supports native video understanding — no separate model required.

Video guidance: up to 128 MB, up to 5 minutes, MP4 format.

Minimal example

Python
curl

copy

from openai import OpenAI

client = OpenAI(
    api_key="YOUR_STEP_API_KEY",
    base_url="https://api.stepfun.ai/v1",
)

response = client.chat.completions.create(
    model="step-3.7-flash",
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "Summarize the main content of this video and pull out key information."
            },
            {
                "type": "video_url",
                "video_url": {
                    "url": "https://example.com/demo.mp4"
                }
            }
        ]
    }],
)

print(response.choices[0].message.content)

copy

curl https://api.stepfun.ai/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $STEP_API_KEY" \
  -d '{
    "model": "step-3.7-flash",
    "messages": [{
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "Summarize the main content of this video and pull out key information."
        },
        {
          "type": "video_url",
          "video_url": {
            "url": "https://example.com/demo.mp4"
          }
        }
      ]
    }]
  }'

For prompt templates on screen-recording diagnostics, action-timeline reconstruction, and more, see the Cookbook.

Control reasoning effort

step-3.7-flash supports three reasoning effort levels — pick one based on task complexity. The Chat Completions API uses reasoning_effort; the Messages API uses output_config.effort.

Effort	Best for
`low`	Simple Q&A, summarization, rewriting, information extraction
`medium`	Default. Suitable for general reasoning and multi-step tasks
`high`	Complex reasoning, math, planning, code analysis

Chat Completions API
Messages API
Python

copy

curl https://api.stepfun.ai/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $STEP_API_KEY" \
  -d '{
    "model": "step-3.7-flash",
    "messages": [
      {
        "role": "user",
        "content": "Explain reinforcement learning in three sentences."
      }
    ],
    "reasoning_effort": "medium",
    "max_tokens": 1024
  }'

copy

curl https://api.stepfun.ai/v1/messages \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $STEP_API_KEY" \
  -d '{
    "model": "step-3.7-flash",
    "max_tokens": 1024,
    "messages": [
      {
        "role": "user",
        "content": "Explain reinforcement learning in three sentences."
      }
    ],
    "output_config": {
      "effort": "medium"
    }
  }'

copy

response = client.chat.completions.create(
    model="step-3.7-flash",
    messages=[{
        "role": "user",
        "content": "Analyze the trends and outliers in this data chart"
    }],
    reasoning_effort="high",  # Use high reasoning effort
    max_tokens=2048,
)

Field reference

`image_url`

Field	Type	Required	Notes
`type`	string	yes	Fixed to `"image_url"`
`url`	string	yes	Image source. Supports URL, Base64, and `stepfile://`
`detail`	string	no	Image detail level: `low` (default) or `high`

`video_url`

Field	Type	Required	Notes
`type`	string	yes	Fixed to `"video_url"`
`url`	string	yes	Video source. Supports URL, Base64, and `stepfile://`

FAQ

Q: Video upload fails. What do I do?

A: Make sure the video meets these conditions:

Format: MP4, QuickTime (.mov), or Matroska (.mkv)
Size: under 128 MB
Duration: under 5 minutes

If the video exceeds the limits, you can split it with ffmpeg:

# Split into 2-minute segments
ffmpeg -i input.mp4 -c copy -f segment -segment_time 120 -reset_timestamps 1 output_%d.mp4

Q: Image / video responses are slow.

A: Upload your files to StepFun storage via the Files API and reference them with stepfile:// for faster access. For images, you can also set detail to low. For video, keep file size and duration small.

Q: How do I process multiple images in one request?

A: Pass multiple image_url items in the content array:

"content": [
    {"type": "text", "text": "Compare the differences between these two images"},
    {"type": "image_url", "image_url": {"url": "image1.jpg"}},
    {"type": "image_url", "image_url": {"url": "image2.jpg"}},
]

Q: Which image formats are supported?

A: JPG / JPEG, PNG, static GIF, and WebP.

Q: Which video formats are supported?

A: MP4, QuickTime (.mov), and Matroska (.mkv).

Next steps

Cookbook

Reusable task templates for whiteboard-to-plan, chart-to-data, receipt-to-table, and more.

Image understanding best practices

A deeper look at image understanding parameters, detail mode, and performance tips.

Video understanding best practices

A deeper look at video understanding limits, pricing estimates, and ffmpeg usage.

Reasoning model guide

Recommended usage of reasoning models for complex tasks, tool calling, and long contexts.

​Prerequisites

​1. Get an API key

​2. Install dependencies

​Image understanding

​Minimal example

​Use a Base64-encoded image

​Use the Files API (recommended)

​Video understanding

​Minimal example

​Control reasoning effort

​Field reference

​image_url

​video_url

​FAQ

​Q: Video upload fails. What do I do?

​Q: Image / video responses are slow.

​Q: How do I process multiple images in one request?

​Q: Which image formats are supported?

​Q: Which video formats are supported?

​Next steps

Cookbook

Image understanding best practices

Video understanding best practices

Reasoning model guide

Prerequisites

1. Get an API key

2. Install dependencies

Image understanding

Minimal example

Use a Base64-encoded image

Use the Files API (recommended)

Video understanding

Minimal example

Control reasoning effort

Field reference

`image_url`

`video_url`

FAQ

Q: Video upload fails. What do I do?

Q: Image / video responses are slow.

Q: How do I process multiple images in one request?

Q: Which image formats are supported?

Q: Which video formats are supported?

Next steps