Omnihuman 1.5 costs $0.16/second versus $0.14/second for the original, but delivers configurable resolution (720p/1080p), turbo mode for faster iteration, and 60-second audio support at 720p.
Choosing Between ByteDance's Talking Head Models
ByteDance's Omnihuman 1.5 represents a significant architectural advancement over its predecessor, building on the Diffusion Transformer framework that has become the dominant paradigm for high-fidelity video synthesis1. Both models available on fal transform a static portrait and audio file into lip-synced video, but they diverge substantially in capability, cost structure, and production utility.
The decision between these models extends beyond simple quality comparisons. Version 1.5 introduces resolution configurability, accelerated generation modes, and extended audio duration support that fundamentally changes how developers can integrate talking head generation into production workflows.
API Schema Comparison
Both models share the same output structure, returning a video URL and duration value used for billing:
{
"video": { "url": "https://..." },
"duration": 10.5
}
The input schemas differ in available parameters:
| Parameter | Original | Version 1.5 | Notes |
|---|---|---|---|
image_url | Required | Required | Publicly accessible URL or base64 data URI |
audio_url | Required | Required | Max 30s (original), 30s at 1080p / 60s at 720p (v1.5) |
resolution | N/A | Optional | Enum: "720p" or "1080p" (default: "1080p") |
turbo_mode | N/A | Optional | Boolean for faster generation with quality tradeoff |
prompt | N/A | Optional | Text guidance for stylistic influence |
Accepted image formats include jpg, jpeg, png, webp, gif, and avif. Audio formats include mp3, ogg, wav, m4a, and aac. Both models accept publicly accessible URLs or base64 data URIs for file inputs.
falMODEL APIs
The fastest, cheapest and most reliable way to run genAI models. 1 API, 100s of models
Version 1.5 Enhancements
Resolution flexibility represents the most consequential upgrade. Version 1.5 allows explicit selection between 720p and 1080p output via the resolution parameter, enabling developers to balance quality against generation speed. The original model generates at fixed resolution without user control.
Audio length handling ties directly to resolution selection. At 720p, version 1.5 processes audio up to 60 seconds. At 1080p, the limit is 30 seconds. The original model enforces a 30-second maximum regardless of other settings.
Turbo mode (turbo_mode: true) accelerates generation with marginal quality reduction, proving valuable during development phases when rapid iteration outweighs pixel-perfect output.
Pricing and Cost Structure
| Video Duration | Original ($0.14/s) | Version 1.5 ($0.16/s) | Difference |
|---|---|---|---|
| 10 seconds | $1.40 | $1.60 | $0.20 |
| 30 seconds | $4.20 | $4.80 | $0.60 |
| 60 seconds | $8.40 | $9.60 | $1.20 |
| 1,000 min/month | $8,400 | $9,600 | $1,200 |
Both models charge based on the duration field in the response, which matches input audio length.
Implementation
Migration from the original to version 1.5 requires changing the endpoint identifier:
// Original
const result = await fal.subscribe("fal-ai/bytedance/omnihuman", {
input: { image_url: "...", audio_url: "..." },
});
// Version 1.5 with optional parameters
const result = await fal.subscribe("fal-ai/bytedance/omnihuman/v1.5", {
input: {
image_url: "...",
audio_url: "...",
resolution: "720p",
turbo_mode: true,
},
});
The fal.subscribe method handles queue management automatically. For long-running requests or webhook-based architectures, use fal.queue.submit instead and poll status via fal.queue.status.
Performance Considerations
Both models operate on fal's serverless infrastructure, which manages cold starts and scales automatically under load. Generation time varies based on audio duration and, for version 1.5, resolution selection and turbo mode activation.
Version 1.5 demonstrates improved robustness for challenging inputs:
- Images with partial face occlusions
- Audio containing background noise
- Unusual lighting conditions in source images
- Profile or three-quarter angle portraits
The original model occasionally requires multiple generation attempts for edge cases that version 1.5 handles more reliably on first pass.
Model Selection Guidelines
Select the original Omnihuman when:
- Operating under strict per-video cost constraints where $0.02/second savings compounds meaningfully
- Audio content stays under 30 seconds consistently
- Building proof-of-concept applications to validate feasibility
- Processing high volumes where individual quality differences matter less than aggregate cost
Select Omnihuman 1.5 when:
- Output quality directly impacts business value or brand presentation
- Production workflow benefits from 720p drafts before final 1080p renders
- Content requires audio segments between 30 and 60 seconds
- Iterative development demands rapid experimentation via turbo mode
- Input images include characteristics that stress model robustness
Architecture Overview
Both Omnihuman variants employ end-to-end multimodality conditioning, accepting a single reference image alongside an audio signal to generate temporally coherent video. The underlying architecture leverages a Diffusion Transformer (DiT) that processes motion-related conditions during training, enabling the model to learn natural motion patterns from large-scale datasets1.
What distinguishes these models from conventional lip-sync tools is their capacity for generating nuanced micro-expressions and emotional dynamics that correlate meaningfully with audio characteristics rather than producing mechanical mouth movements overlaid on a static face2.
Strategic Considerations
For most new projects, version 1.5 represents the superior choice. Quality improvements, resolution flexibility, and extended audio support provide meaningful advantages that outweigh the modest cost increase. The ability to generate 720p videos faster while maintaining superior quality makes version 1.5 particularly attractive for production workflows.
The original remains viable for cost-sensitive applications where output quality meets requirements and feature limitations do not constrain the use case.
The advantage of fal's implementation is the absence of lock-in. Switching between models based on specific request requirements allows using the original for high-volume operations while reserving version 1.5 for premium content. This flexibility enables optimizing the quality-cost tradeoff at granular per-request level.
Both models deliver results that transform static portraits into dynamic, emotionally expressive video within seconds. Whether selecting the original Omnihuman for proven reliability or version 1.5 for enhanced capabilities, developers access sophisticated image-to-video generation practical for production applications.
Recently Added
References
-
Lin, G., Jiang, J., Yang, J., Zheng, Z., & Liang, C. (2025). OmniHuman-1: Rethinking the Scaling-Up of One-Stage Conditioned Human Animation Models. arXiv preprint arXiv:2502.01061. https://arxiv.org/abs/2502.01061 ↩ ↩2
-
Rakesh, V. K., Mazumdar, S., Maity, R. P., Pal, S., Das, A., & Samanta, T. (2025). Advancing Talking Head Generation: A Comprehensive Survey of Multi-Modal Methodologies, Datasets, Evaluation Metrics, and Loss Functions. arXiv preprint arXiv:2507.02900. https://arxiv.org/abs/2507.02900 ↩

![Image-to-image editing with LoRA support for FLUX.2 [klein] 9B from Black Forest Labs. Specialized style transfer and domain-specific modifications.](/_next/image?url=https%3A%2F%2Fv3b.fal.media%2Ffiles%2Fb%2F0a8aaeb2%2FFZOclk1jcZaVZAP_C12Qe_edbbb28567484c48bd205f24bafd6225.jpg&w=3840&q=75)
![Image-to-image editing with LoRA support for FLUX.2 [klein] 4B from Black Forest Labs. Specialized style transfer and domain-specific modifications.](/_next/image?url=https%3A%2F%2Fv3b.fal.media%2Ffiles%2Fb%2F0a8aae07%2FWKhXnfsA7BNpDGwCXarGn_52f0f2fdac2c4fc78b2765b6c662222b.jpg&w=3840&q=75)
![Image-to-image editing with Flux 2 [klein] 4B Base from Black Forest Labs. Precise modifications using natural language descriptions and hex color control.](/_next/image?url=https%3A%2F%2Fv3b.fal.media%2Ffiles%2Fb%2F0a8a7f49%2FnKsGN6UMAi6IjaYdkmILC_e20d2097bb984ad589518cf915fe54b4.jpg&w=3840&q=75)
![Text-to-image generation with FLUX.2 [klein] 9B Base from Black Forest Labs. Enhanced realism, crisper text generation, and native editing capabilities.](/_next/image?url=https%3A%2F%2Fv3b.fal.media%2Ffiles%2Fb%2F0a8a7f3c%2F90FKDpwtSCZTqOu0jUI-V_64c1a6ec0f9343908d9efa61b7f2444b.jpg&w=3840&q=75)
![Image-to-image editing with Flux 2 [klein] 9B Base from Black Forest Labs. Precise modifications using natural language descriptions and hex color control.](/_next/image?url=https%3A%2F%2Fv3b.fal.media%2Ffiles%2Fb%2F0a8a7f50%2FX8ffS5h55gcigsNZoNC7O_52e6b383ac214d2abe0a2e023f03de88.jpg&w=3840&q=75)
![Text-to-image generation with Flux 2 [klein] 4B Base from Black Forest Labs. Enhanced realism, crisper text generation, and native editing capabilities.](/_next/image?url=https%3A%2F%2Fv3b.fal.media%2Ffiles%2Fb%2F0a8a7f36%2FbYUAh_nzYUAUa_yCBkrP1_2dd84022eeda49e99db95e13fc588e47.jpg&w=3840&q=75)
![Image-to-image editing with Flux 2 [klein] 4B from Black Forest Labs. Precise modifications using natural language descriptions and hex color control.](/_next/image?url=https%3A%2F%2Fv3b.fal.media%2Ffiles%2Fb%2F0a8a7f40%2F-9rbLPCsz36IFb-4t3J2L_76750002c0db4ce899b77e98321ffe30.jpg&w=3840&q=75)
![Text-to-image generation with Flux 2 [klein] 4B from Black Forest Labs. Enhanced realism, crisper text generation, and native editing capabilities.](/_next/image?url=https%3A%2F%2Fv3b.fal.media%2Ffiles%2Fb%2F0a8a7f30%2FUwGq5qBE9zqd4r6QI7En0_082c2d0376a646378870218b6c0589f9.jpg&w=3840&q=75)








