bytedance/seedance-2.0/text-to-video
Input
Customize your input with more control.
Result
What would you like to do next?
For every second of 720p video you generated, you will be charged $0.3034/second. Your request will cost $0.014 per 1000 tokens. The number of tokens is given by (height of output video * width of output video * duration * 24) / 1024.
Logs
Run Seedance 2.0 AI Image To Video API on fal
ByteDance's most advanced video generation model. Generate cinematic video with native audio, real-world physics, and director-level camera control, all in a single pass.
`bytedance/seedance-2.0`
Overview
Seedance 2.0 is built on a unified multimodal architecture that accepts text, images, video clips, and audio as inputs and produces coherent, audio-synchronized video output. Audio and video are generated together natively, no post-production layering.
API Endpoints
| Endpoint | Model ID |
|---|---|
| Text to Video | `bytedance/seedance-2.0/text-to-video` |
| Text to Video (Fast) | `bytedance/seedance-2.0/fast/text-to-video` |
| Image to Video | `bytedance/seedance-2.0/image-to-video` |
| Image to Video (Fast) | `bytedance/seedance-2.0/fast/image-to-video` |
| Reference to Video | `bytedance/seedance-2.0/reference-to-video` |
| Reference to Video (Fast) | `bytedance/seedance-2.0/fast/reference-to-video` |
Standard endpoints prioritize maximum quality. Fast endpoints offer lower latency and cost for production workloads.
Pricing
| Endpoint | Price |
|---|---|
| 720p with audio | $0.3034 / second |
| 720p fast with audio | $0.2419 / second |
What's new in 2.0
Multimodal reference inputs. Combine up to 9 images, 3 video clips, and 3 audio files in a single generation via the reference-to-video endpoint. Reference them in your prompt using `[Image1]`, `[Video1]`, `[Audio1]`, etc.
Better motion and physics. More realistic rendering of complex interactions — sports, dancing, fighting, object collisions, and more.
Video editing and extension. Provide a reference video and describe what to change, or describe what should happen next to extend it.
Intelligent duration. Set `duration` to `"auto"` and the model picks the optimal length for the content.
Adaptive aspect ratio. Set `aspect_ratio` to `"auto"` and the model chooses the best fit based on your inputs.
Usage
Install the client:
bashnpm install --save @fal-ai/client
Note:
`@fal-ai/serverless-client`is deprecated. Use`@fal-ai/client`instead.
Python
pythonimport fal_client result = fal_client.subscribe( "bytedance/seedance-2.0/text-to-video", arguments={ "prompt": "A spear-wielding warrior clashes with a dual-blade fighter in a maple leaf forest. Autumn leaves scatter on each impact.", "duration": "5", "resolution": "720p", "aspect_ratio": "16:9", } ) print(result["video"]["url"])
JavaScript
javascriptimport { fal } from "@fal-ai/client"; const result = await fal.subscribe("bytedance/seedance-2.0/text-to-video", { input: { prompt: "A spear-wielding warrior clashes with a dual-blade fighter in a maple leaf forest. Autumn leaves scatter on each impact.", duration: "5", resolution: "720p", aspect_ratio: "16:9", }, logs: true, onQueueUpdate: (update) => { if (update.status === "IN_PROGRESS") { update.logs.map((log) => log.message).forEach(console.log); } }, }); console.log(result.data);
REST
bashcurl -X POST https://fal.run/bytedance/seedance-2.0/text-to-video \ -H "Authorization: Key $FAL_KEY" \ -H "Content-Type: application/json" \ -d '{ "prompt": "A spear-wielding warrior clashes with a dual-blade fighter in a maple leaf forest.", "duration": "5", "resolution": "720p", "aspect_ratio": "16:9" }'
Input schema
Text to Video (`bytedance/seedance-2.0/text-to-video`)
| Parameter | Type | Default | Description |
|---|---|---|---|
`prompt` | string | — | Required. Scene description. Put spoken dialogue in double quotes for lip-synced audio. |
`resolution` | string | `"720p"` | `"480p"` or `"720p"` |
`duration` | string | `"auto"` | `"auto"`, or `"4"` through `"15"` |
`aspect_ratio` | string | `"auto"` | `"auto"`, `"21:9"`, `"16:9"`, `"4:3"`, `"1:1"`, `"3:4"`, `"9:16"` |
`generate_audio` | boolean | `true` | Generate synchronized audio alongside video. |
`seed` | integer | — | Optional seed for reproducibility. |
`end_user_id` | string | — | Required for B2B access. Unique identifier for your end customer. |
Image to Video (`bytedance/seedance-2.0/image-to-video`)
All text-to-video parameters, plus:
| Parameter | Type | Description |
|---|---|---|
`image_url` | string | Required. Start frame image URL. Accepted: jpg, jpeg, png, webp, gif, avif. |
`end_image_url` | string | Optional end frame image to control where the video concludes. |
Reference to Video (`bytedance/seedance-2.0/reference-to-video`)
Accepts text prompts combined with up to 9 images, 3 video clips, and 3 audio files. Reference inputs in your prompt as `[Image1]`, `[Video1]`, `[Audio1]`, etc.
Output schema
json{ "video": { "url": "https://v3b.fal.media/files/...", "content_type": "video/mp4", "file_name": "video.mp4", "file_size": 4352150 }, "seed": 1094575694 }
Access the video URL at `result["video"]["url"]` (Python) or `result.data.video.url` (JavaScript).
Supported resolutions
| 21:9 | 16:9 | 4:3 | 1:1 | 3:4 | 9:16 | |
|---|---|---|---|---|---|---|
| 480p | 992×432 | 864×496 | 752×560 | 640×640 | 560×752 | 496×864 |
| 720p | 1470×630 | 1280×720 | 1112×834 | 960×960 | 834×1112 | 720×1280 |
Capabilities
Text to video. Describe a scene and get video with matching audio. The model handles multi-subject interactions, camera movements, and emotional tone. For dialogue, put speech in double quotes — the model generates matching lip movements and voice.
Image to video. Animate a still image using it as the first frame. Optionally provide an end frame to control where the video concludes. The model preserves the look and style of your input image while adding natural motion.
Multimodal reference. Combine images, videos, and audio as references in a single generation via the reference-to-video endpoint. Provide a reference video for motion style, reference images for character appearance, and reference audio for rhythm — then describe how to combine them. Powerful for outfit-change videos, product showcases, and music-synced content.
Video editing. Provide a reference video and describe changes — replace an object, change a background, alter the style. The model preserves original motion and camera work while applying your edits.
Video extension. Provide a reference video and describe what should happen next. The model continues the scene with consistent characters, environment, and style.
Tips
- Be specific. Describe camera movements, lighting, mood, and specific actions for best results.
- Dialogue. Wrap spoken lines in double quotes:
`The man stopped and said: "Remember this moment."` - Referencing inputs. Label them explicitly in your prompt:
`"The character from [Image1] performs the dance from [Video1]."` - Video editing. Describe what to change and what to preserve:
`"Replace the perfume in [Video1] with the face cream from [Image1], keeping all original motion."` - Iterate fast. Start with 5-second generations to nail the style, then increase duration.