bytedance/seedance-2.0/reference-to-video

ByteDance's most advanced reference-to-video model. Generate video from up to 9 images, 3 videos, and 3 audio clips with native audio and cinematic camera control.

Learn more about Seedance 2

Inference

Commercial use

Partner

Schema

LLMs

Playground API Examples

Input

Prompt*

Image Urls

Hint: Drag and drop files from your computer, images from web pages, paste from clipboard (Ctrl/Cmd+V), or provide a URL.

Image 1

Image 2

2 images added

Video Urls

Hint: Drag and drop files from your computer, images from web pages, paste from clipboard (Ctrl/Cmd+V), or provide a URL.

Audio Urls

Hint: Drag and drop files from your computer, images from web pages, paste from clipboard (Ctrl/Cmd+V), or provide a URL.

Resolution

Duration

Generate Audio

End User Id

Additional Settings

Customize your input with more control.

Result

Idle

What would you like to do next?

Download

{
  "video": {
    "url": "https://v3b.fal.media/files/b/0a959a18/ojveMSkwmwthBKa4b-lIF_video.mp4",
    "content_type": "video/mp4",
    "file_name": "video.mp4",
    "file_size": 3151620
  },
  "seed": 1546308651
}

For every second of 720p video you generated, you will be charged $0.3024/second. Your request will cost $0.014 per 1000 tokens. The number of tokens is given by (height of output video * width of output video * ( input duration + output duration) * 24) / 1024. If video inputs are provided the price is multiplied by 0.6. With video inputs and 720p resolution the price is $0.1814 per second.

Logs

Run Seedance 2.0 AI Reference To Video API on fal

ByteDance's most advanced reference-to-video model, available on fal as `bytedance/seedance-2.0/reference-to-video`.

Overview

The most flexible Seedance 2.0 endpoint: provide a text prompt alongside up to 12 reference files spanning images, videos, and audio clips, and the model weaves them into a single cinematic output. Reference assets in your prompt using `@Image1`, `@Video1`, `@Audio1`, etc.

Key capabilities:

Native audio generation: music, SFX, and lip-synced dialogue, all in a single pass at no extra cost
Director-level camera control via prompt: dolly zooms, tracking shots, POV switches, handheld movement
Realistic physics: fluid dynamics, cloth, character motion
Multi-shot editing: natural cuts within a single generation, up to 15 seconds
Output up to 1080p

Inputs

Modality	Limit	Formats	Constraints
Images	Up to 9	JPEG, PNG, WebP	Max 30 MB each
Videos	Up to 3	MP4, MOV	Combined duration 2–15 s, total under 50 MB, 480p–720p resolution per video
Audio	Up to 3	MP3, WAV	Combined duration ≤ 15 s, max 15 MB each; requires at least one image or video

Total files across all modalities must not exceed 12.

Parameters

Parameter	Type	Default	Options
`prompt`	string	—	Text description; reference assets as `@Image1`, `@Video1`, `@Audio1`, etc.
`image_urls`	list<string>	—	Reference image URLs
`video_urls`	list<string>	—	Reference video URLs
`audio_urls`	list<string>	—	Reference audio URLs
`resolution`	enum	`720p`	`480p`, `720p`, `1080p`
`duration`	enum	`auto`	`auto` or any integer from `4` to `15` seconds
`aspect_ratio`	enum	`auto`	`auto`, `21:9`, `16:9`, `4:3`, `1:1`, `3:4`, `9:16`
`generate_audio`	boolean	`true`	Synchronized audio: SFX, ambient sound, lip-synced speech. Same price either way.
`seed`	integer	—	Fix for reproducibility (minor variation may still occur)
`end_user_id`	string	—	Optional identifier for the end user

Pricing

Billed per second of generated 720p output:

Condition	Rate	10-sec clip
Standard, no video input	$0.3024 / sec	~$3.02
Standard, with video input	$0.1814 / sec (0.6× discount)	~$1.81
Fast tier, no video input	$0.2419 / sec	~$2.42
Fast tier, with video input	$0.1452 / sec (0.6× discount)	~$1.45
Token-based billing	$0.014 / 1,000 tokens	—

Token formula (note: includes both input and output duration):


tokens = (height × width × (input_duration + output_duration) × 24) / 1024

Audio generation is included at no extra cost regardless of the `generate_audio` setting.

Quick Start

Python

bash
pip install fal-client
export FAL_KEY="YOUR_API_KEY"

python
import fal_client

result = fal_client.subscribe(
    "bytedance/seedance-2.0/reference-to-video",
    arguments={
        "prompt": "@Image1 shows a product on a marble surface. Slow dolly in with dramatic side lighting, dust particles floating in the air.",
        "image_urls": ["https://your-host.com/product.jpg"],
        "resolution": "1080p",
        "duration": "8",
        "aspect_ratio": "16:9",
        "generate_audio": True,
    },
    with_logs=True,
    on_queue_update=lambda u: [print(l["message"]) for l in u.logs]
    if isinstance(u, fal_client.InProgress) else None,
)

print(result["video"]["url"])

Multi-modal example with images, video, and audio:

python
result = fal_client.subscribe(
    "bytedance/seedance-2.0/reference-to-video",
    arguments={
        "prompt": "Recreate the scene from @Video1 but replace the background with the environment from @Image1. Use @Audio1 as the soundtrack.",
        "image_urls": ["https://your-host.com/background.jpg"],
        "video_urls": ["https://your-host.com/scene.mp4"],
        "audio_urls": ["https://your-host.com/soundtrack.mp3"],
        "resolution": "720p",
        "duration": "auto",
        "aspect_ratio": "auto",
    },
)

JavaScript / Node.js

bash
npm install @fal-ai/client
export FAL_KEY="YOUR_API_KEY"

js
import { fal } from "@fal-ai/client";

const result = await fal.subscribe("bytedance/seedance-2.0/reference-to-video", {
  input: {
    prompt: "@Image1 shows a product on a marble surface. Slow dolly in with dramatic side lighting, dust particles floating in the air.",
    image_urls: ["https://your-host.com/product.jpg"],
    resolution: "1080p",
    duration: "8",
    aspect_ratio: "16:9",
    generate_audio: true,
  },
  logs: true,
  onQueueUpdate: (update) => {
    if (update.status === "IN_PROGRESS") {
      update.logs.map((log) => log.message).forEach(console.log);
    }
  },
});

console.log(result.data.video.url);

Output

json
{
  "video": {
    "url": "https://...",
    "content_type": "video/mp4",
    "file_name": "output.mp4",
    "file_size": 4823041
  },
  "seed": 42
}

Async / Queue Usage

python
handler = fal_client.submit(
    "bytedance/seedance-2.0/reference-to-video",
    arguments={...},
    webhook_url="https://your-server.com/webhook",
)

request_id = handler.request_id
status = fal_client.status("bytedance/seedance-2.0/reference-to-video", request_id, with_logs=True)
result = fal_client.result("bytedance/seedance-2.0/reference-to-video", request_id)

Standard vs. Fast Tier

The only functional difference between the two tiers is resolution support and cost. Use fast unless you need 1080p.

	Standard	Fast
Endpoint	`bytedance/seedance-2.0/reference-to-video`	`bytedance/seedance-2.0/fast/reference-to-video`
Max resolution	1080p	720p
Cost (10 sec, 720p, no video input)	~$3.02	~$2.42
Cost (10 sec, 720p, with video input)	~$1.81	~$1.45
Latency	Higher	Lower
Schema	Identical	Identical

Enterprise Variant

A separate enterprise endpoint adds face-input support for consistent character identity across generations:


fal-ai/seedance-2/enterprise/reference-to-video

Availability

April 2, 2026: Launched with enterprise-only, geo-restricted access
April 9, 2026: All restrictions lifted, fully open with no geographic or use-case limitations

bytedance/seedance-2.0/reference-to-video

Input

Result

What would you like to do next?

Logs

Run Seedance 2.0 AI Reference To Video API on fal

Overview

Inputs

Parameters

Pricing

Quick Start

Python

JavaScript / Node.js

Output

Async / Queue Usage

Standard vs. Fast Tier

Enterprise Variant

Availability

Resources