Run the latest models all in one Sandbox 🏖️

Omnihuman 1.5 Prompt Guide: Character-Driven Video Generation

Explore all models

Omnihuman 1.5 interprets prompts semantically through multimodal reasoning, not simple audio synchronization. Structure prompts as mini-screenplays: camera direction, emotional arc, speaking state, and sequential actions. Use 720p for faster generation (up to 60s audio) or 1080p for quality (up to 30s audio).

last updated
1/11/2026
edited by
Zachary Roth
read time
7 minutes
Omnihuman 1.5 Prompt Guide: Character-Driven Video Generation

From Audio Sync to Semantic Understanding

Avatar generation has long operated on a mechanical principle: detect audio, move lips. Omnihuman 1.5 abandons this reactive approach entirely. Built on dual-system cognitive architecture inspired by psychological research on deliberative and intuitive thinking, the model synthesizes semantic guidance from text prompts, audio input, and source images simultaneously.1

This architectural shift means your prompts carry substantial weight. Rather than decorative descriptions, they function as semantic scaffolding that shapes how the model interprets emotion, camera movement, and character behavior. The difference between adequate output and professional-grade animation often comes down to prompt construction.

API Integration

The model accepts three primary inputs: image_url, audio_url, and prompt. The prompt parameter guides semantic interpretation of how the character should behave.

const result = await fal.subscribe("fal-ai/bytedance/omnihuman/v1.5", {
  input: {
    image_url: "https://example.com/portrait.png",
    audio_url: "https://example.com/speech.mp3",
    prompt:
      "The camera slowly pushes in. She speaks thoughtfully, pausing mid-sentence with a slight smile.",
    resolution: "1080p",
    turbo_mode: false,
  },
});

falMODEL APIs

The fastest, cheapest and most reliable way to run genAI models. 1 API, 100s of models

falSERVERLESS

Scale custom models and apps to thousands of GPUs instantly

falCOMPUTE

A fully controlled GPU cloud for enterprise AI training + research

Input Requirements and Parameters

ParameterTypeDescription
image_urlstring (required)Portrait image. Accepts: jpg, jpeg, png, webp, gif, avif
audio_urlstring (required)Speech/audio file. Accepts: mp3, ogg, wav, m4a, aac
promptstringText guidance for character behavior and camera movement
resolutionenum720p (up to 60s audio, faster) or 1080p (up to 30s audio, default)
turbo_modebooleanFaster generation with slight quality tradeoff

Billing is $0.16 per second of generated video. A 30-second 1080p video costs approximately $4.80. The API returns a duration field in the response for precise cost tracking.

The Fundamental Prompt Structure

Effective prompts follow a sequential structure mirroring how a director describes a scene. Consider your prompt a mini-screenplay rather than a static description.

The recommended formula: [Camera movement] + [Emotion/mood] + [Speaking state] + [Specific actions]

This structure works because the model processes your prompt left-to-right, establishing context before details. Leading with camera movement sets the visual framework. Emotion colors everything that follows. Speaking state helps synchronize lip movements. Specific actions provide the behavioral blueprint.

Example: "The camera slowly pushes in from a wide shot to a medium close-up. A woman sits at a cafe table, thoughtful and slightly melancholic, speaking softly to someone off-camera. She pauses mid-sentence, glances down at her coffee, then looks back up with a faint smile."

Camera Movement Vocabulary

Camera instructions substantially affect professional quality. The model supports sophisticated choreography but requires explicit instruction.

Camera TypeExample Prompt LanguageBest Use Case
Push-in"The camera slowly dollies in from a medium shot to a close-up"Dialogue, emotional moments
Static"A static medium shot holds on the subject"Talking heads, presentations
Orbital"The camera orbits from left profile to three-quarter view"Music videos, dynamic content
Angle shift"The camera shifts from low angle to eye level"Dramatic reveals

Avoid vague instructions like "the camera moves around." Specify starting position, ending position, and movement quality (slow, smooth, handheld-style).

Emotional Direction and Progressions

The model understands nuanced emotional states. Describe emotions with specificity rather than generic terms.

Instead of "She looks happy," write: "She appears genuinely delighted, with bright eyes and a warm, natural smile that reaches her cheeks."

The model responds well to emotional progressions within a single prompt. Research on audio-driven portrait animation demonstrates that long-term motion dependency modeling produces more natural results.2 Leverage this by describing transitions:

"A man starts with a neutral, focused expression while listening, then gradually softens into an empathetic smile as he begins speaking."

Speaking State and Lip-Sync

While Omnihuman 1.5 automatically synchronizes lip movements to audio, your prompt influences synchronization quality. High-impact speaking verbs:

  • "talks directly to the camera"
  • "speaks passionately while gesturing"
  • "sings along with the music"
  • "whispers conspiratorially"
  • "delivers dialogue with dramatic pauses"

When audio includes singing, mention it explicitly. For silent moments, state: "She listens intently without speaking, responding with subtle nods."

Action Choreography

The model excels at multi-step action sequences when you provide clear progressions. Use sequence words like "First... then... finally" or "Initially... as the audio continues... by the end."

Example: "The character first looks directly at the camera with a serious expression while speaking. Then, midway through, she glances to her right as if noticing something. Finally, she returns her gaze to the camera with a knowing smile."

Avoid contradictory instructions. For emotional complexity, describe transitions: "She attempts to smile despite visible sadness in her eyes."

Resolution Strategy

Choose resolution based on your use case:

720p: Faster generation, supports up to 60 seconds of audio. Use for iteration, drafts, and longer-form content. Despite lower resolution, quality remains high.

1080p: Default resolution, limited to 30 seconds of audio. Use for final outputs where visual fidelity is paramount.

Enable turbo_mode: true for faster generation when iterating on prompts. The quality tradeoff is minimal for draft work.

Multi-Character Scenes

Omnihuman 1.5 supports multi-character scenes. Establish clear spatial relationships and specify who performs which actions.

Example: "A woman sits in the foreground, speaking directly to the camera with animated gestures. Behind her, a man stands slightly out of focus, listening attentively and occasionally nodding."

Key principles: establish foreground/background hierarchy, specify which character speaks at any moment, and use spatial language (beside, behind, to the left of) for positioning.

Troubleshooting Common Issues

Character not matching audio emotion: Your prompt may conflict with audio content. Ensure emotional direction aligns with the audio's tone.

Unnatural lip movements: Add explicit speaking verbs to your prompt. Specify whether the character is speaking, singing, or listening.

Static or minimal movement: Prompts lacking action verbs produce static results. Include sequential behaviors and camera movement.

Generation timeout: For audio approaching duration limits, use the Queue API with webhooks rather than blocking requests.

Complete Prompt Examples

Professional dialogue: "The camera slowly pushes in from a medium shot to a medium close-up. A businesswoman sits at a conference table, initially serious and focused as she speaks to colleagues off-camera. Midway through, her expression softens slightly, showing confidence mixed with approachability."

Music video: "A static medium shot with slight handheld-style movement. The singer performs directly to camera with passionate intensity, expressions shifting dynamically with the music's emotional peaks. She sways gently to the rhythm."

Narrative: "The camera begins with a wide shot and gradually dollies in. A young man sits on a park bench, initially gazing into the distance with a contemplative expression. As he begins speaking, he turns toward the camera, his expression warming into a nostalgic smile."

Implementation Strategy

Start with the fundamental structure: camera + emotion + speaking + action. Generate at 720p to iterate quickly. Once satisfied with the prompt, switch to 1080p for final output.

Build a personal prompt library. Save prompts that produce excellent results for different scenarios. These become templates for future projects.

For production applications with longer generation times, use fal.queue.submit() with webhooks instead of blocking on fal.subscribe(). The Queue API documentation covers implementation patterns for handling asynchronous generation.

Recently Added

References

  1. Jiang, J., Zeng, W., Zheng, Z., Yang, J., Liang, C., Liao, W., Liang, H., Zhang, Y., & Gao, M. "OmniHuman-1.5: Instilling an Active Mind in Avatars via Cognitive Simulation." arXiv, 2025. https://arxiv.org/abs/2508.19209

  2. Jiang, J., Liang, C., Yang, J., Lin, G., Zhong, T., & Zheng, Y. "Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion Dependency." International Conference on Learning Representations (ICLR), 2025. https://arxiv.org/abs/2409.02634

about the author
Zachary Roth
A generative media engineer with a focus on growth, Zach has deep expertise in building RAG architecture for complex content systems.

Related articles