Seedance 1.5 generates synchronized audio alongside video at 720p maximum resolution with variable duration control (4-12 seconds). Seedance v1 outputs 1080p video without audio at roughly. Choose 1.5 when integrated audio saves production time or enables new creative directions.
Choosing Between ByteDance's Video Models
ByteDance's Seedance models represent two distinct approaches to AI video generation. Seedance 1.5 Pro introduces joint audio-video generation, producing synchronized sound alongside visuals from a single text prompt. The original Seedance v1 focuses exclusively on visual output, but supports higher 1080p resolution. The decision between them hinges on whether your workflow prioritizes integrated audio or maximum visual fidelity.
The architectural differences between these models extend beyond feature sets. Seedance 1.5 employs a dual-branch diffusion transformer that renders video and audio in the same latent space, enabling tight lip-sync and natural foley without post-production work. This multimodal approach builds on research demonstrating that joint audio-video training improves both semantic alignment and temporal synchronization compared to cascaded generation pipelines.1
Seedance v1, by contrast, channels all computational resources toward visual generation, achieving superior resolution at the cost of audio capability.
Core Capabilities
Seedance v1 established ByteDance's presence in multi-shot video generation with support for both text-to-video and image-to-video workflows. The architecture employs decoupled spatial and temporal layers with an interleaved multimodal positional encoding scheme, enabling native multi-shot generation and consistent subject representation across temporal-spatial transformations. At 1080p output resolution, v1 remains the higher-fidelity option for purely visual applications.
Seedance 1.5 Pro represents a fundamental architectural shift rather than an incremental update. This is ByteDance's first joint audio-video model, processing complex prompts that describe both visual elements and audio cues simultaneously. The model interprets dialogue, environmental sounds, and musical elements alongside visual descriptions. According to fal's documentation, it uses a dual-branch diffusion transformer to render video and audio in the same latent space, producing tight lip-sync and natural foley without additional post-production steps.
falMODEL APIs
The fastest, cheapest and most reliable way to run genAI models. 1 API, 100s of models
Technical Specifications
Seedance v1 on fal supports six aspect ratios (21:9 through 9:16) with 1080p output as its primary advantage. Both models share identical aspect ratio options.
Seedance 1.5 Pro expands the parameter space considerably:
- Resolution options: 480p for faster generation, 720p for balanced quality
- Aspect ratios: Six options spanning 21:9 ultra-wide to 9:16 vertical, covering cinematic formats through mobile-optimized outputs
- Duration control: Variable length from 4 to 12 seconds, enabling precise cost management
- Camera controls: Optional fixed camera position for static shot compositions
- Audio toggle: Enable or disable audio generation based on workflow requirements
- Safety checker: Configurable content moderation
The prompt structure for Seedance 1.5 accommodates audio descriptions directly. The key difference from v1 appears in the prompt itself:
// Seedance v1 prompt (visual only)
"Courtroom scene, defense attorney giving closing argument, jury watching intently"
// Seedance 1.5 prompt (visual + audio)
"Defense attorney declaring 'Ladies and gentlemen, reasonable doubt is the foundation of justice itself', footsteps on marble, jury shifting, courtroom drama"
The model interprets both the visual scene and acoustic landscape from this single input.
Performance and Speed Comparison
Generation speed differs meaningfully between these models, affecting both development workflows and production economics.
| Specification | Seedance v1 | Seedance 1.5 Pro |
|---|---|---|
| Maximum Resolution | 1080p | 720p (parameter) |
| Audio Generation | No | Yes (synchronized) |
| Duration Range | 2-12 seconds | 4-12 seconds |
| Aspect Ratio Options | Six (21:9 to 9:16) | Six (21:9 to 9:16) |
| Pricing (5s video) | ~$0.62 (1080p) | ~$0.26 (720p with audio) |
| Video Extension | No | Yes |
| End-Frame Conditioning | No | Yes |
Seedance v1 delivers predictable generation times without audio processing overhead. A 5-second 1080p video generates in approximately 41 seconds on an NVIDIA L20 GPU. For projects requiring 1080p output, this remains the only option between the two models.
Seedance 1.5 offers dual resolution modes for different use cases. The 480p mode prioritizes speed, suitable for rapid prototyping and preview generation. The 720p mode balances quality and generation time for production use. Because ByteDance architected this as a joint model rather than separate video and audio pipelines, the audio does not simply double generation time. Both modalities are processed simultaneously, yielding efficient combined output.
Pricing Structure
Cost efficiency varies based on parameter selection and workflow structure.
Seedance v1 pricing follows a straightforward model based on resolution and duration. Each 1080p 5-second video costs approximately $0.62, with other resolutions priced at $2.5 per million video tokens calculated as (height x width x FPS x duration) / 1024.
Seedance 1.5 Pro pricing reflects the integrated audio capability. Each 720p 5-second video with audio costs approximately $0.26. For other resolutions, pricing is $2.4 per million video tokens with audio enabled and $1.2 per million tokens without audio. Developers who want 1.5's advanced features (video extension, end-frame conditioning) without the audio cost can disable audio generation and pay the lower rate. When factoring in the cost of separate audio generation, synchronization, and post-processing that would otherwise require additional API calls and processing time, the combined output becomes economically attractive.
The 480p option in Seedance 1.5 provides a budget-friendly entry point for:
- Rapid concept testing and creative exploration
- Social media content optimized for mobile viewing
- High-volume generation scenarios with flexible resolution requirements
- Development and testing phases before final production
Output Quality Characteristics
Seedance v1 produces videos with smooth motion, rich detail, and naturalistic color grading. The model maintains temporal coherence across frames, avoiding jittery or morphing artifacts. For image-to-video workflows, source image consistency remains high, with generated motion extending naturally from the starting frame.
Seedance 1.5 Pro maintains these visual quality standards while adding contextually appropriate audio. The synchronized audio generation produces spatially consistent sound that matches visual timing and scene characteristics across four categories: dialogue and speech with appropriate emotional tone, sound effects synchronized with visual elements, ambient environmental audio, and musical accompaniment when prompted.
Advanced Capabilities
Seedance 1.5 Pro extends beyond standard text-to-video generation with capabilities unavailable in v1.
Image-to-Video with Audio allows you to upload a start frame and optionally an end frame. Seedance 1.5 Pro generates the motion, camera movement, dialogue, and sound design in between.
Video Extension enables extending existing video clips while preserving motion continuity, subject identity, and scene coherence. Your prompt guides subsequent action with optional audio generation for extended segments. For additional video extension options, consider LTX Video-0.9.7 13B or Pixverse.
Use Case Recommendations
Choose Seedance v1 when:
- maximum 1080p resolution is required
- animating existing assets where source consistency matters
- audio will be added separately in post-production
For alternative image-to-video approaches, Pixverse Image to Video offers comparable capabilities.
Choose Seedance 1.5 Pro when:
- integrated audio matters for your application (social media, advertising, educational videos)
- complex multi-dimensional prompts describe your creative vision with specific dialogue and sounds
- you need video extension or end-frame conditioning for creative control unavailable in v1
Migration Considerations
Migrating from Seedance v1 to 1.5 requires updating the endpoint to fal-ai/bytedance/seedance/v1.5/pro/text-to-video. Existing authentication and client library code remains compatible. Review the fal documentation for implementation details.
Your existing v1 prompts will work with 1.5, but enhanced prompts leverage the full capabilities by adding audio descriptions. The generate_audio parameter defaults to true; set this to false explicitly for video-only output matching v1 behavior. For resolution, 720p provides comparable perceived quality to 1080p on most modern displays.
Decision Framework
Does your output require 1080p resolution? If yes, Seedance v1 is your only option. If 720p suffices, proceed to question two.
Does synchronized audio add value? If your workflow includes professional audio production or the content is purely visual, v1 makes sense. If integrated audio saves time or enables new creative directions, 1.5 delivers clear advantages. For audio enhancement, DeepFilterNet 3 can clean up generated audio in post-processing.
Do you need video extension or end-frame conditioning? Seedance 1.5 Pro offers these capabilities while v1 does not.
Recently Added
References
-
Ruan, Ludan, et al. "MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation." arXiv:2212.09478 (2022). https://arxiv.org/abs/2212.09478 ↩

![Image-to-image editing with LoRA support for FLUX.2 [klein] 9B from Black Forest Labs. Specialized style transfer and domain-specific modifications.](/_next/image?url=https%3A%2F%2Fv3b.fal.media%2Ffiles%2Fb%2F0a8aaeb2%2FFZOclk1jcZaVZAP_C12Qe_edbbb28567484c48bd205f24bafd6225.jpg&w=3840&q=75)
![Image-to-image editing with LoRA support for FLUX.2 [klein] 4B from Black Forest Labs. Specialized style transfer and domain-specific modifications.](/_next/image?url=https%3A%2F%2Fv3b.fal.media%2Ffiles%2Fb%2F0a8aae07%2FWKhXnfsA7BNpDGwCXarGn_52f0f2fdac2c4fc78b2765b6c662222b.jpg&w=3840&q=75)
![Image-to-image editing with Flux 2 [klein] 4B Base from Black Forest Labs. Precise modifications using natural language descriptions and hex color control.](/_next/image?url=https%3A%2F%2Fv3b.fal.media%2Ffiles%2Fb%2F0a8a7f49%2FnKsGN6UMAi6IjaYdkmILC_e20d2097bb984ad589518cf915fe54b4.jpg&w=3840&q=75)
![Text-to-image generation with FLUX.2 [klein] 9B Base from Black Forest Labs. Enhanced realism, crisper text generation, and native editing capabilities.](/_next/image?url=https%3A%2F%2Fv3b.fal.media%2Ffiles%2Fb%2F0a8a7f3c%2F90FKDpwtSCZTqOu0jUI-V_64c1a6ec0f9343908d9efa61b7f2444b.jpg&w=3840&q=75)
![Image-to-image editing with Flux 2 [klein] 9B Base from Black Forest Labs. Precise modifications using natural language descriptions and hex color control.](/_next/image?url=https%3A%2F%2Fv3b.fal.media%2Ffiles%2Fb%2F0a8a7f50%2FX8ffS5h55gcigsNZoNC7O_52e6b383ac214d2abe0a2e023f03de88.jpg&w=3840&q=75)
![Text-to-image generation with Flux 2 [klein] 4B Base from Black Forest Labs. Enhanced realism, crisper text generation, and native editing capabilities.](/_next/image?url=https%3A%2F%2Fv3b.fal.media%2Ffiles%2Fb%2F0a8a7f36%2FbYUAh_nzYUAUa_yCBkrP1_2dd84022eeda49e99db95e13fc588e47.jpg&w=3840&q=75)
![Image-to-image editing with Flux 2 [klein] 4B from Black Forest Labs. Precise modifications using natural language descriptions and hex color control.](/_next/image?url=https%3A%2F%2Fv3b.fal.media%2Ffiles%2Fb%2F0a8a7f40%2F-9rbLPCsz36IFb-4t3J2L_76750002c0db4ce899b77e98321ffe30.jpg&w=3840&q=75)
![Text-to-image generation with Flux 2 [klein] 4B from Black Forest Labs. Enhanced realism, crisper text generation, and native editing capabilities.](/_next/image?url=https%3A%2F%2Fv3b.fal.media%2Ffiles%2Fb%2F0a8a7f30%2FUwGq5qBE9zqd4r6QI7En0_082c2d0376a646378870218b6c0589f9.jpg&w=3840&q=75)








