fal-ai/wan/v2.2-14b/speech-to-video

Wan-S2V is a video model that generates high-quality videos from static images and audio, with realistic facial expressions, body movements, and professional camera work for film and television applications

Inference

Commercial use

Schema

LLMs

Playground API

Input

Prompt*

Type # to reference inputs.

Number of Frames

Frames per Second

Image URL*

Hint: Drag and drop image files from your computer, images from web pages, paste from clipboard (Ctrl/Cmd+V), or provide a URL. Accepted file types: jpg, jpeg, png, webp, gif, avif

Audio URL*

Hint: Drag and drop audio files from your computer, audio from web pages, paste from clipboard (Ctrl/Cmd+V), or provide a URL. Accepted file types: mp3, ogg, wav, m4a, aac

Additional Settings

Customize your input with more control.

Result

Idle

This generation takes approximately 5m.

What would you like to do next?

Download

{
  "video": {
    "content_type": "application/octet-stream",
    "file_name": "2c7ab2540af44eceaf5ffde4e8d094ed.mp4",
    "url": "https://v3.fal.media/files/panda/f7tXRCjvwEcVlmxHuw8kO_2c7ab2540af44eceaf5ffde4e8d094ed.mp4",
    "file_size": 4685303
  }
}

Your request will cost $0.20 per video second for 720p, $0.15 per video second for 580p, $0.10 per video second for 480p. Video seconds are calculated at 16 frames per second.

fal-ai/wan/v2.2-14b/speech-to-video

Input

Result

What would you like to do next?

Logs