Documentation Index
Fetch the complete documentation index at: https://platform.stepfun.ai/docs/llms.txt
Use this file to discover all available pages before exploring further.
step-3.7-flash lets you mix images, video, and text in a single conversation. It’s well-suited to turning visual information from the real world into plans, tables, code drafts, or diagnostic findings. This page collects prompt templates for common scenarios to help you quickly decide how to shape your inputs and outputs for each task.
These examples focus on task design and output format. For how to call the API, see the Quickstart. For details on image and video parameters, see Image understanding best practices and Video understanding best practices.
Usage tips
- Be explicit about the output format: ask for a Markdown table, JSON, CSV rows, or a task list.
- Require evidence for key fields: for amounts, dates, chart values, or task owners, have the model cite the source or mark uncertainty.
- Don’t let the model guess missing info: fields that are unclear should be
null, empty strings, or “cannot confirm”. - For high-risk data such as finance, expenses, contracts, or medical info, always have a human review.
Whiteboard to plan
Good for meeting whiteboards, sticky-note walls, hand-drawn flowcharts, project-discussion photos. The goal is to convert loose information into an actionable plan, not to transcribe word for word.Chart to data
Good for report screenshots, dashboard screenshots, bar charts, line charts, pie charts. The goal is to convert chart content into structured data while keeping track of uncertainty.Chart screenshots can be affected by resolution, compression, and axis scaling. When you need exact numbers, prefer the original data source. The model’s extraction is best treated as a first draft or as assistance for manual entry.
Receipt to table
Good for receipts, invoices, expense reports, shopping slips. The goal is to convert the document into structured row data that can be pasted directly into a spreadsheet or piped into a system.Screenshot to code
Good for web pages, mobile UIs, component screenshots, and design mockups. The goal is to produce an initial HTML / React / Tailwind draft you can iterate on.Screen-recording diagnostics
Good for software-operation recordings, bug-repro videos, app usage paths, customer-support recordings. The goal is to reconstruct the user’s actions, locate the anomaly, and provide debug guidance.Multi-image comparison
Good for comparing design revisions, product photos, UI-state differences, scanned-page differences.Structured output tips
When you need to feed the result into a program or spreadsheet, ask for JSON or CSV explicitly and state your null policy:response_format to enable JSON Mode. See JSON Mode usage tips and Chat Completions API.
For tasks that need human review, have the model emit confidence alongside values:
Next steps
Multimodal quickstart
Learn the basics of calling images, video, Base64, and the Files API.
Chat Completions API
See
messages, image_url, video_url, reasoning_effort, and other parameters.Mobile Agent
Connect to a real Android device via GELab-Zero and have the model plan mobile operations.