Wan-2.2 Speech-to-Video 14B Audio to Video
fal-ai/wan/v2.2-14b/speech-to-video
Wan-S2V is a video model that generates high-quality videos from static images and audio, with realistic facial expressions, body movements, and professional camera work for film and television applications
Inference
Commercial use
Input
Hint: Drag and drop image files from your computer, images from web pages, paste from clipboard (Ctrl/Cmd+V), or provide a URL. Accepted file types: jpg, jpeg, png, webp, gif, avif

Hint: Drag and drop audio files from your computer, audio from web pages, paste from clipboard (Ctrl/Cmd+V), or provide a URL. Accepted file types: mp3, ogg, wav, m4a, aac
Additional Settings
Customize your input with more control.
Result
Idle
This generation takes approximately 5m.
What would you like to do next?
Your request will cost $0.20 per video second for 720p, $0.15 per video second for 580p, $0.10 per video second for 480p.