Run the latest models all in one Sandbox 🏖️

Sync Lipsync 2.0 Video to Video

fal-ai/sync-lipsync/v2
Generate realistic lipsync animations from audio using advanced algorithms for high-quality synchronization with Sync Lipsync 2.0 model
Inference
Commercial use
Partner

Input

Additional Settings

Customize your input with more control.

Result

Idle

What would you like to do next?

Your request will cost $3 per minute of video.

Logs

Sync Lipsync 2.0 | [video-to-video]

Sync Labs' Lipsync 2.0 generates frame-accurate lip synchronization from any audio source at $3 per minute of video. With automated audio-visual alignment, the model processes existing video footage and audio files to produce natural-looking speech animations. Built for content creators who need realistic dubbing, localization, or voice replacement without reshooting footage.

Use Cases: Video Dubbing & Localization | Content Creator Voice Replacement | Marketing Video Personalization


Performance

At $3 per minute, Sync Lipsync 2.0 processes video-to-video transformations with audio-driven facial animation, positioning itself as a production-ready tool for dubbing workflows that previously required manual rotoscoping or expensive studio sessions.

MetricResultContext
Model Variantslipsync-2, lipsync-2-proPro variant costs $5/minute (1.67x standard) for enhanced quality
Cost per Minute$3.00Standard lipsync-2 model pricing
Input FormatsMP4, MOV, WebM, M4V video / MP3, OGG, WAV, M4A, AAC audioAccepts web URLs or direct file uploads
Sync Modes5 duration handling optionscut_off, loop, bounce, silence, remap for audio/video length mismatches
Related EndpointsLipsync 1.9.0-beta, Lipsync 2.0 ProPrevious generation and quality-optimized variants

Audio-Driven Facial Animation Without Reshooting

Sync Lipsync 2.0 uses audio waveform analysis to generate mouth movements that match speech phonemes, eliminating the need for motion capture rigs or manual keyframe animation. Unlike traditional dubbing that requires actors to physically re-perform scenes, this approach applies new audio to existing footage while preserving the original performance's timing and emotion.

What this means for you:

  • Multi-language content without studio time: Dub marketing videos, tutorials, or social content into different languages by swapping audio tracks. No need to reshoot with native speakers or hire voice actors for on-camera work.

  • 5 sync mode options for duration mismatches: When audio runs longer or shorter than video, choose cut_off (trim excess), loop (repeat video), bounce (reverse playback), silence (pad with stillness), or remap (time-stretch) to maintain synchronization without manual editing.

  • Two-tier quality system: Standard lipsync-2 handles most conversational content at $3/minute, while lipsync-2-pro ($5/minute) delivers enhanced facial detail for close-up shots or high-stakes commercial work where subtle mouth movements matter.

  • URL-based workflow integration: Submit video and audio via direct URLs rather than uploading files, enabling automated processing pipelines for batch dubbing or content localization systems that pull from cloud storage.


Technical Specifications

SpecDetails
ArchitectureSync Lipsync 2.0
Input FormatsVideo: MP4, MOV, WebM, M4V, GIF / Audio: MP3, OGG, WAV, M4A, AAC
Output FormatsMP4 video with synchronized audio
Sync Modescut_off, loop, bounce, silence, remap
LicenseCommercial use permitted with Partner designation

API Documentation | Quickstart Guide | Enterprise Pricing


How It Stacks Up

Lipsync 2.0 Pro ($5/minute) – Sync Lipsync 2.0 prioritizes cost efficiency for high-volume dubbing at $3/minute. Lipsync 2.0 Pro trades 1.67x higher cost for enhanced facial animation quality, ideal for close-up commercial content or projects where subtle mouth movement accuracy justifies the premium.

MiniMax Video 01 Live – Sync Lipsync 2.0 focuses specifically on audio-driven lip synchronization for existing footage. MiniMax Video 01 Live generates complete video sequences from text prompts, serving text-to-video creation workflows rather than audio-based editing of existing content.