Happy Horse 1.0 is now on fal

bytedance/seedance-2.0/reference-to-video

ByteDance's most advanced reference-to-video model. Generate video from up to 9 images, 3 videos, and 3 audio clips with native audio and cinematic camera control.
Inference
Commercial use
Partner

Input

Additional Settings

Customize your input with more control.

Result

Idle

What would you like to do next?

For every second of 720p video you generated, you will be charged $0.3024/second. Your request will cost $0.014 per 1000 tokens. The number of tokens is given by (height of output video * width of output video * ( input duration + output duration) * 24) / 1024. If video inputs are provided the price is multiplied by 0.6. With video inputs and 720p resolution the price is $0.1814 per second.

Logs

Run Seedance 2.0 AI Reference To Video API on fal

ByteDance's most advanced reference-to-video model, available on fal as `bytedance/seedance-2.0/reference-to-video`.


Overview

The most flexible Seedance 2.0 endpoint: provide a text prompt alongside up to 12 reference files spanning images, videos, and audio clips, and the model weaves them into a single cinematic output. Reference assets in your prompt using `@Image1`, `@Video1`, `@Audio1`, etc.

Key capabilities:

  • Native audio generation: music, SFX, and lip-synced dialogue, all in a single pass at no extra cost
  • Director-level camera control via prompt: dolly zooms, tracking shots, POV switches, handheld movement
  • Realistic physics: fluid dynamics, cloth, character motion
  • Multi-shot editing: natural cuts within a single generation, up to 15 seconds
  • Output up to 1080p

Inputs

ModalityLimitFormatsConstraints
ImagesUp to 9JPEG, PNG, WebPMax 30 MB each
VideosUp to 3MP4, MOVCombined duration 2–15 s, total under 50 MB, 480p–720p resolution per video
AudioUp to 3MP3, WAVCombined duration ≤ 15 s, max 15 MB each; requires at least one image or video

Total files across all modalities must not exceed 12.


Parameters

ParameterTypeDefaultOptions
`prompt`stringText description; reference assets as `@Image1`, `@Video1`, `@Audio1`, etc.
`image_urls`list<string>Reference image URLs
`video_urls`list<string>Reference video URLs
`audio_urls`list<string>Reference audio URLs
`resolution`enum`720p``480p`, `720p`, `1080p`
`duration`enum`auto``auto` or any integer from `4` to `15` seconds
`aspect_ratio`enum`auto``auto`, `21:9`, `16:9`, `4:3`, `1:1`, `3:4`, `9:16`
`generate_audio`boolean`true`Synchronized audio: SFX, ambient sound, lip-synced speech. Same price either way.
`seed`integerFix for reproducibility (minor variation may still occur)
`end_user_id`stringOptional identifier for the end user

Pricing

Billed per second of generated 720p output:

ConditionRate10-sec clip
Standard, no video input$0.3024 / sec~$3.02
Standard, with video input$0.1814 / sec (0.6× discount)~$1.81
Fast tier, no video input$0.2419 / sec~$2.42
Fast tier, with video input$0.1452 / sec (0.6× discount)~$1.45
Token-based billing$0.014 / 1,000 tokens

Token formula (note: includes both input and output duration):

tokens = (height × width × (input_duration + output_duration) × 24) / 1024

Audio generation is included at no extra cost regardless of the `generate_audio` setting.


Quick Start

Python
bash
pip install fal-client
export FAL_KEY="YOUR_API_KEY"
python
import fal_client

result = fal_client.subscribe(
    "bytedance/seedance-2.0/reference-to-video",
    arguments={
        "prompt": "@Image1 shows a product on a marble surface. Slow dolly in with dramatic side lighting, dust particles floating in the air.",
        "image_urls": ["https://your-host.com/product.jpg"],
        "resolution": "1080p",
        "duration": "8",
        "aspect_ratio": "16:9",
        "generate_audio": True,
    },
    with_logs=True,
    on_queue_update=lambda u: [print(l["message"]) for l in u.logs]
    if isinstance(u, fal_client.InProgress) else None,
)

print(result["video"]["url"])

Multi-modal example with images, video, and audio:

python
result = fal_client.subscribe(
    "bytedance/seedance-2.0/reference-to-video",
    arguments={
        "prompt": "Recreate the scene from @Video1 but replace the background with the environment from @Image1. Use @Audio1 as the soundtrack.",
        "image_urls": ["https://your-host.com/background.jpg"],
        "video_urls": ["https://your-host.com/scene.mp4"],
        "audio_urls": ["https://your-host.com/soundtrack.mp3"],
        "resolution": "720p",
        "duration": "auto",
        "aspect_ratio": "auto",
    },
)
JavaScript / Node.js
bash
npm install @fal-ai/client
export FAL_KEY="YOUR_API_KEY"
js
import { fal } from "@fal-ai/client";

const result = await fal.subscribe("bytedance/seedance-2.0/reference-to-video", {
  input: {
    prompt: "@Image1 shows a product on a marble surface. Slow dolly in with dramatic side lighting, dust particles floating in the air.",
    image_urls: ["https://your-host.com/product.jpg"],
    resolution: "1080p",
    duration: "8",
    aspect_ratio: "16:9",
    generate_audio: true,
  },
  logs: true,
  onQueueUpdate: (update) => {
    if (update.status === "IN_PROGRESS") {
      update.logs.map((log) => log.message).forEach(console.log);
    }
  },
});

console.log(result.data.video.url);

Output

json
{
  "video": {
    "url": "https://...",
    "content_type": "video/mp4",
    "file_name": "output.mp4",
    "file_size": 4823041
  },
  "seed": 42
}

Async / Queue Usage

python
handler = fal_client.submit(
    "bytedance/seedance-2.0/reference-to-video",
    arguments={...},
    webhook_url="https://your-server.com/webhook",
)

request_id = handler.request_id
status = fal_client.status("bytedance/seedance-2.0/reference-to-video", request_id, with_logs=True)
result = fal_client.result("bytedance/seedance-2.0/reference-to-video", request_id)

Standard vs. Fast Tier

The only functional difference between the two tiers is resolution support and cost. Use fast unless you need 1080p.

StandardFast
Endpoint`bytedance/seedance-2.0/reference-to-video``bytedance/seedance-2.0/fast/reference-to-video`
Max resolution1080p720p
Cost (10 sec, 720p, no video input)~$3.02~$2.42
Cost (10 sec, 720p, with video input)~$1.81~$1.45
LatencyHigherLower
SchemaIdenticalIdentical

Enterprise Variant

A separate enterprise endpoint adds face-input support for consistent character identity across generations:

fal-ai/seedance-2/enterprise/reference-to-video

Availability

  • April 2, 2026: Launched with enterprise-only, geo-restricted access
  • April 9, 2026: All restrictions lifted, fully open with no geographic or use-case limitations

Resources