Run the latest models all in one Sandbox 🏖️

OmniHuman Image to Video

fal-ai/bytedance/omnihuman
OmniHuman generates video using an image of a human figure paired with an audio file. It produces vivid, high-quality videos where the character’s emotions and movements maintain a strong correlation with the audio.
Inference
Commercial use
Partner

Input

Result

Idle

What would you like to do next?

Your request will cost $0.14 per second.

Logs

OmniHuman | [image-to-video]

ByteDance's OmniHuman model generates audio-synchronized videos from a single reference image at $0.14 per second of output. With specialized lip-sync precision, the model trained on 18,700 hours of human motion data to maintain tight correlation between audio input and character movement. Built for content creators who need realistic talking head videos without motion capture equipment.

Use Cases: Social Media Content | Product Demos with Presenters | Educational Video Production


Performance

At $0.14 per second, OmniHuman sits in the mid-range for image-to-video generation on fal, trading cost for specialized audio-sync capabilities that generic models don't prioritize.

MetricResultContext
Audio Sync QualityTight emotion/movement correlationTrained on 18,700 hours of human motion data
Max Audio Duration30 secondsHard limit enforced at API level
Cost per Second$0.14Billed on actual audio/video duration
Output QualityHigh-fidelity videoSpecialized for human figure animation
Related EndpointsOmniHuman v1.5, Seedance Pro, Seedance LiteByteDance family variants for different quality/cost tradeoffs

Audio-First Video Generation

OmniHuman flips the standard image-to-video workflow by making audio the primary control signal rather than text prompts or motion parameters. Where most models animate based on text descriptions or keyframes, this architecture analyzes audio waveforms to drive facial expressions, lip movements, and body language simultaneously.

What this means for you:

  • Natural speech synchronization: Upload any audio file under 30 seconds and get matching lip movements without manual keyframe adjustment or phoneme mapping

  • Emotion-driven animation: The model interprets audio tone and pacing to generate corresponding facial expressions and body gestures, not just mouth shapes

  • Single-image input: Start with one reference photo rather than building character rigs or providing multiple angles

  • Production-ready output: High-fidelity video maintains visual consistency across the full duration without drift or quality degradation


Technical Specifications

SpecDetails
ArchitectureOmniHuman
Input FormatsSingle image (JPEG, PNG, WebP, GIF, AVIF) + audio file (MP3, OGG, WAV, M4A, AAC)
Output FormatsMP4 video
Max Audio Duration30 seconds
LicenseCommercial use allowed via fal partnership

API Documentation | Quickstart Guide | Enterprise Pricing


How It Stacks Up

Bytedance OmniHuman v1.5 ($0.14/sec) – OmniHuman v1.5 offers enhanced audio processing and improved motion quality at the same $0.14 per second price point. The original OmniHuman remains viable for projects where the current quality threshold meets requirements without needing v1.5's refinements.

Seedance 1.0 Pro ($0.14/sec) – OmniHuman prioritizes audio-driven human animation with specialized lip-sync capabilities. Seedance Pro trades audio control for broader creative motion generation across any subject type, ideal for product animations or scene transitions where audio sync isn't critical.

Seedance 1.0 Lite ($0.08/sec) – Seedance Lite delivers 43% cost savings ($0.08 vs $0.14 per second) by simplifying motion generation without audio processing. Best for budget-conscious projects that can handle reduced quality or don't require speech synchronization.