Omnihuman 1.5 interprets prompts semantically through multimodal reasoning, not simple audio synchronization. Structure prompts as mini-screenplays: camera direction, emotional arc, speaking state, and sequential actions. Use 720p for faster generation (up to 60s audio) or 1080p for quality (up to 30s audio).
From Audio Sync to Semantic Understanding
Avatar generation has long operated on a mechanical principle: detect audio, move lips. Omnihuman 1.5 abandons this reactive approach entirely. Built on dual-system cognitive architecture inspired by psychological research on deliberative and intuitive thinking, the model synthesizes semantic guidance from text prompts, audio input, and source images simultaneously.1
This architectural shift means your prompts carry substantial weight. Rather than decorative descriptions, they function as semantic scaffolding that shapes how the model interprets emotion, camera movement, and character behavior. The difference between adequate output and professional-grade animation often comes down to prompt construction.
API Integration
The model accepts three primary inputs: image_url, audio_url, and prompt. The prompt parameter guides semantic interpretation of how the character should behave.
const result = await fal.subscribe("fal-ai/bytedance/omnihuman/v1.5", {
input: {
image_url: "https://example.com/portrait.png",
audio_url: "https://example.com/speech.mp3",
prompt:
"The camera slowly pushes in. She speaks thoughtfully, pausing mid-sentence with a slight smile.",
resolution: "1080p",
turbo_mode: false,
},
});
falMODEL APIs
The fastest, cheapest and most reliable way to run genAI models. 1 API, 100s of models
Input Requirements and Parameters
| Parameter | Type | Description |
|---|---|---|
image_url | string (required) | Portrait image. Accepts: jpg, jpeg, png, webp, gif, avif |
audio_url | string (required) | Speech/audio file. Accepts: mp3, ogg, wav, m4a, aac |
prompt | string | Text guidance for character behavior and camera movement |
resolution | enum | 720p (up to 60s audio, faster) or 1080p (up to 30s audio, default) |
turbo_mode | boolean | Faster generation with slight quality tradeoff |
Billing is $0.16 per second of generated video. A 30-second 1080p video costs approximately $4.80. The API returns a duration field in the response for precise cost tracking.
The Fundamental Prompt Structure
Effective prompts follow a sequential structure mirroring how a director describes a scene. Consider your prompt a mini-screenplay rather than a static description.
The recommended formula: [Camera movement] + [Emotion/mood] + [Speaking state] + [Specific actions]
This structure works because the model processes your prompt left-to-right, establishing context before details. Leading with camera movement sets the visual framework. Emotion colors everything that follows. Speaking state helps synchronize lip movements. Specific actions provide the behavioral blueprint.
Example: "The camera slowly pushes in from a wide shot to a medium close-up. A woman sits at a cafe table, thoughtful and slightly melancholic, speaking softly to someone off-camera. She pauses mid-sentence, glances down at her coffee, then looks back up with a faint smile."
Camera Movement Vocabulary
Camera instructions substantially affect professional quality. The model supports sophisticated choreography but requires explicit instruction.
| Camera Type | Example Prompt Language | Best Use Case |
|---|---|---|
| Push-in | "The camera slowly dollies in from a medium shot to a close-up" | Dialogue, emotional moments |
| Static | "A static medium shot holds on the subject" | Talking heads, presentations |
| Orbital | "The camera orbits from left profile to three-quarter view" | Music videos, dynamic content |
| Angle shift | "The camera shifts from low angle to eye level" | Dramatic reveals |
Avoid vague instructions like "the camera moves around." Specify starting position, ending position, and movement quality (slow, smooth, handheld-style).
Emotional Direction and Progressions
The model understands nuanced emotional states. Describe emotions with specificity rather than generic terms.
Instead of "She looks happy," write: "She appears genuinely delighted, with bright eyes and a warm, natural smile that reaches her cheeks."
The model responds well to emotional progressions within a single prompt. Research on audio-driven portrait animation demonstrates that long-term motion dependency modeling produces more natural results.2 Leverage this by describing transitions:
"A man starts with a neutral, focused expression while listening, then gradually softens into an empathetic smile as he begins speaking."
Speaking State and Lip-Sync
While Omnihuman 1.5 automatically synchronizes lip movements to audio, your prompt influences synchronization quality. High-impact speaking verbs:
- "talks directly to the camera"
- "speaks passionately while gesturing"
- "sings along with the music"
- "whispers conspiratorially"
- "delivers dialogue with dramatic pauses"
When audio includes singing, mention it explicitly. For silent moments, state: "She listens intently without speaking, responding with subtle nods."
Action Choreography
The model excels at multi-step action sequences when you provide clear progressions. Use sequence words like "First... then... finally" or "Initially... as the audio continues... by the end."
Example: "The character first looks directly at the camera with a serious expression while speaking. Then, midway through, she glances to her right as if noticing something. Finally, she returns her gaze to the camera with a knowing smile."
Avoid contradictory instructions. For emotional complexity, describe transitions: "She attempts to smile despite visible sadness in her eyes."
Resolution Strategy
Choose resolution based on your use case:
720p: Faster generation, supports up to 60 seconds of audio. Use for iteration, drafts, and longer-form content. Despite lower resolution, quality remains high.
1080p: Default resolution, limited to 30 seconds of audio. Use for final outputs where visual fidelity is paramount.
Enable turbo_mode: true for faster generation when iterating on prompts. The quality tradeoff is minimal for draft work.
Multi-Character Scenes
Omnihuman 1.5 supports multi-character scenes. Establish clear spatial relationships and specify who performs which actions.
Example: "A woman sits in the foreground, speaking directly to the camera with animated gestures. Behind her, a man stands slightly out of focus, listening attentively and occasionally nodding."
Key principles: establish foreground/background hierarchy, specify which character speaks at any moment, and use spatial language (beside, behind, to the left of) for positioning.
Troubleshooting Common Issues
Character not matching audio emotion: Your prompt may conflict with audio content. Ensure emotional direction aligns with the audio's tone.
Unnatural lip movements: Add explicit speaking verbs to your prompt. Specify whether the character is speaking, singing, or listening.
Static or minimal movement: Prompts lacking action verbs produce static results. Include sequential behaviors and camera movement.
Generation timeout: For audio approaching duration limits, use the Queue API with webhooks rather than blocking requests.
Complete Prompt Examples
Professional dialogue: "The camera slowly pushes in from a medium shot to a medium close-up. A businesswoman sits at a conference table, initially serious and focused as she speaks to colleagues off-camera. Midway through, her expression softens slightly, showing confidence mixed with approachability."
Music video: "A static medium shot with slight handheld-style movement. The singer performs directly to camera with passionate intensity, expressions shifting dynamically with the music's emotional peaks. She sways gently to the rhythm."
Narrative: "The camera begins with a wide shot and gradually dollies in. A young man sits on a park bench, initially gazing into the distance with a contemplative expression. As he begins speaking, he turns toward the camera, his expression warming into a nostalgic smile."
Implementation Strategy
Start with the fundamental structure: camera + emotion + speaking + action. Generate at 720p to iterate quickly. Once satisfied with the prompt, switch to 1080p for final output.
Build a personal prompt library. Save prompts that produce excellent results for different scenarios. These become templates for future projects.
For production applications with longer generation times, use fal.queue.submit() with webhooks instead of blocking on fal.subscribe(). The Queue API documentation covers implementation patterns for handling asynchronous generation.
Recently Added
References
-
Jiang, J., Zeng, W., Zheng, Z., Yang, J., Liang, C., Liao, W., Liang, H., Zhang, Y., & Gao, M. "OmniHuman-1.5: Instilling an Active Mind in Avatars via Cognitive Simulation." arXiv, 2025. https://arxiv.org/abs/2508.19209 ↩
-
Jiang, J., Liang, C., Yang, J., Lin, G., Zhong, T., & Zheng, Y. "Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion Dependency." International Conference on Learning Representations (ICLR), 2025. https://arxiv.org/abs/2409.02634 ↩

![Image-to-image editing with LoRA support for FLUX.2 [klein] 9B from Black Forest Labs. Specialized style transfer and domain-specific modifications.](/_next/image?url=https%3A%2F%2Fv3b.fal.media%2Ffiles%2Fb%2F0a8aaeb2%2FFZOclk1jcZaVZAP_C12Qe_edbbb28567484c48bd205f24bafd6225.jpg&w=3840&q=75)
![Image-to-image editing with LoRA support for FLUX.2 [klein] 4B from Black Forest Labs. Specialized style transfer and domain-specific modifications.](/_next/image?url=https%3A%2F%2Fv3b.fal.media%2Ffiles%2Fb%2F0a8aae07%2FWKhXnfsA7BNpDGwCXarGn_52f0f2fdac2c4fc78b2765b6c662222b.jpg&w=3840&q=75)
![Image-to-image editing with Flux 2 [klein] 4B Base from Black Forest Labs. Precise modifications using natural language descriptions and hex color control.](/_next/image?url=https%3A%2F%2Fv3b.fal.media%2Ffiles%2Fb%2F0a8a7f49%2FnKsGN6UMAi6IjaYdkmILC_e20d2097bb984ad589518cf915fe54b4.jpg&w=3840&q=75)
![Text-to-image generation with FLUX.2 [klein] 9B Base from Black Forest Labs. Enhanced realism, crisper text generation, and native editing capabilities.](/_next/image?url=https%3A%2F%2Fv3b.fal.media%2Ffiles%2Fb%2F0a8a7f3c%2F90FKDpwtSCZTqOu0jUI-V_64c1a6ec0f9343908d9efa61b7f2444b.jpg&w=3840&q=75)
![Image-to-image editing with Flux 2 [klein] 9B Base from Black Forest Labs. Precise modifications using natural language descriptions and hex color control.](/_next/image?url=https%3A%2F%2Fv3b.fal.media%2Ffiles%2Fb%2F0a8a7f50%2FX8ffS5h55gcigsNZoNC7O_52e6b383ac214d2abe0a2e023f03de88.jpg&w=3840&q=75)
![Text-to-image generation with Flux 2 [klein] 4B Base from Black Forest Labs. Enhanced realism, crisper text generation, and native editing capabilities.](/_next/image?url=https%3A%2F%2Fv3b.fal.media%2Ffiles%2Fb%2F0a8a7f36%2FbYUAh_nzYUAUa_yCBkrP1_2dd84022eeda49e99db95e13fc588e47.jpg&w=3840&q=75)
![Image-to-image editing with Flux 2 [klein] 4B from Black Forest Labs. Precise modifications using natural language descriptions and hex color control.](/_next/image?url=https%3A%2F%2Fv3b.fal.media%2Ffiles%2Fb%2F0a8a7f40%2F-9rbLPCsz36IFb-4t3J2L_76750002c0db4ce899b77e98321ffe30.jpg&w=3840&q=75)
![Text-to-image generation with Flux 2 [klein] 4B from Black Forest Labs. Enhanced realism, crisper text generation, and native editing capabilities.](/_next/image?url=https%3A%2F%2Fv3b.fal.media%2Ffiles%2Fb%2F0a8a7f30%2FUwGq5qBE9zqd4r6QI7En0_082c2d0376a646378870218b6c0589f9.jpg&w=3840&q=75)








