- Audio To Video
- Text To Video
Endpoint:
POST https://fal.run/fal-ai/kling-video/lipsync/audio-to-video
Endpoint ID: fal-ai/kling-video/lipsync/audio-to-videoTry it in the Playground
Run this model interactively with your own prompts.
Quick Start
Input Schema
The URL of the video to generate the lip sync for. Supports .mp4/.mov, ≤100MB, 2–10s, 720p/1080p only, width/height 720–1920px.
The URL of the audio to generate the lip sync for. Minimum duration is 2s and maximum duration is 60s. Maximum file size is 5MB.
Output Schema
The generated video
Input Example
Output Example
Performance
At $0.014 per 5-second video increment (rounded up), Kling LipSync positions as a specialized audio-to-video tool trading inference speed for lip-sync precision. Processing takes approximately 12 minutes regardless of video duration within the 2-10 second input range.| Metric | Result | Context |
|---|---|---|
| Inference Speed | ~12 minutes | Fixed processing time for 2-10s input videos |
| Cost per Video | $0.014 per 5s increment | Billed in 5-second increments (3s video = 5s charge) |
| Input Video Duration | 2-10 seconds | 720p/1080p, ≤100MB, .mp4/.mov only |
| Audio Duration | 2-60 seconds | ≤5MB file size, .mp3/.wav/.ogg/.m4a/.aac formats |
| Resolution Support | 720p-1080p | Width/height constrained to 720-1920px |
Precision Lip-Sync Without Training Data
Kling LipSync uses audio-driven facial animation architecture that generates mouth movements directly from audio waveforms without requiring speaker-specific training data, contrasting with traditional lip-sync approaches that need extensive footage of the target speaker. What this means for you:- Zero-shot speaker adaptation: Sync any audio to any face without pre-training on that specific person, enabling rapid dubbing workflows across multiple speakers and languages
- Extended audio support: Process up to 60 seconds of audio against 2-10 second video clips, useful for looping background characters or extending dialogue beyond source footage length
- Format flexibility: Accepts 5 audio formats (.mp3, .wav, .ogg, .m4a, .aac) and standard video containers (.mp4, .mov), integrating into existing video generation workflows without format conversion overhead
- Increment-based pricing transparency: 5-second billing increments mean predictable costs (a 3-second video costs the same as 5 seconds at 0.028)
Technical Specifications
| Spec | Details |
|---|---|
| Architecture | Kling LipSync |
| Input Formats | Video: .mp4, .mov (2-10s, ≤100MB) / Audio: .mp3, .wav, .ogg, .m4a, .aac (2-60s, ≤5MB) |
| Output Formats | .mp4 video with synchronized lip movements |
| Resolution Requirements | 720p or 1080p input, width/height 720-1920px |
| License | Commercial use via fal partnership |
Related
- Kling LipSync Text-to-Video — Video Generation
- Kling LipSync Audio-to-Video — Video Generation
Limitations
voice_languagerestricted to:zh,envoice_speedrange: 0.8 to 2