Hunyuan Avatar Image to Video

fal-ai/hunyuan-avatar
HunyuanAvatar is a High-Fidelity Audio-Driven Human Animation model for Multiple Characters .
Inference
Commercial use

Input

Additional Settings

Customize your input with more control.

Result

Idle
This generation takes approximately 8m.

What would you like to do next?

For 5s video your request will cost $1.40

Logs

Readme

Hunyuan Avatar - High-Fidelity Audio-Driven Human Animation

Transform audio and images into high-quality AI avatar videos with Hunyuan Avatar, an advanced audio-driven human animation model designed for creating dynamic, emotion-controllable, and multi-character dialogue videos.

Overview

HunyuanAvatar is a High-Fidelity Audio-Driven Human Animation model for Multiple Characters. The model excels at generating highly dynamic videos while preserving character consistency, achieving precise emotion alignment between characters and audio, and enabling multi-character audio-driven animation through innovative multimodal diffusion transformer (MM-DiT) architecture.

Key Capabilities

Create production-ready avatar videos with:

Character Consistency Preservation

  • Generate dynamic videos while maintaining strong character consistency
  • Character image injection module eliminates condition mismatch between training and inference
  • Fine-tune facial characteristics across different poses and expressions

Audio-Driven Animation

  • High-fidelity audio-driven human animation capabilities
  • Audio Emotion Module (AEM) extracts and transfers emotional cues from reference images
  • Face-Aware Audio Adapter (FAA) enables independent audio injection for multi-character scenarios

Multi-Character Support

  • Generate multi-character dialogue videos from single inputs
  • Independent audio injection via cross-attention for multiple characters
  • Realistic avatars in dynamic, immersive scenarios
Getting Started

First, install the fal.ai client library:


Set up authentication:


Generate your first avatar video:


API Parameters
Required Parameters
  • audio_url: The URL of the audio file (supported formats: mp3, ogg, wav, m4a, aac)
  • image_url: The URL of the reference image (supported formats: jpg, jpeg, png, webp, gif, avif)
Optional Parameters
  • text: Text prompt describing the scene (default: "A cat is singing.")
  • num_frames: Number of video frames to generate at 25 FPS (default: 129)
  • num_inference_steps: Number of inference steps for sampling (default: 30)
  • turbo_mode: Enable faster processing (default: true)
Input Example

File Handling

Hunyuan Avatar supports multiple input methods:

URL Input

File Upload via fal Storage

Queue Management

For production applications, use the queue API:


Output Format

Best Practices for Optimal Results

Audio Quality Optimization

  • Use clear, high-quality audio files for better lip-sync results
  • Supported audio formats: mp3, ogg, wav, m4a, aac
  • Ensure audio length matches desired video duration

Image Quality Optimization

  • Provide high-resolution reference images showing clear facial features
  • Use well-lit images with the subject facing the camera
  • Supported image formats: jpg, jpeg, png, webp, gif, avif

Technical Implementation

  • Implement proper error handling for API responses
  • Monitor processing time (approximately 8 minutes per generation)
  • Handle rate limits appropriately in production environments
  • Use webhooks for long-running requests
Technical Specifications

Model Architecture:

  • Base: Multimodal Diffusion Transformer (MM-DiT)
  • Innovations: Character Image Injection Module, Audio Emotion Module (AEM), Face-Aware Audio Adapter (FAA)
  • Processing Time: ~8 minutes average
  • Frame Rate: 25 FPS

Key Innovations:

  • Character image injection module for consistency
  • Audio Emotion Module for emotion alignment
  • Face-Aware Audio Adapter for multi-character scenarios
Pricing and Usage
  • Cost: $1.40 for 5s video
  • Processing Time: This generation takes approximately 8m
  • Commercial Use: Generated content can be used commercially
  • Billing: Pay only for successful generations
Applications

HunyuanAvatar supports various downstream tasks:

  • E-commerce product demonstrations
  • Online streaming and content creation
  • Social media video production
  • Video content creation and editing
  • Multi-character dialogue videos
  • Talking avatar videos
Support and Resources

Get help and learn more:

Model Variants

Related Hunyuan models available:

  • fal-ai/hunyuan-video: Text-to-video generation
  • fal-ai/hunyuan-custom: Custom video generation with identity consistency
  • fal-ai/hunyuan3d: Image-to-3D generation
  • fal-ai/hunyuan-video-lora-training: LoRA training for Hunyuan Video

Start building with Hunyuan Avatar today to create dynamic, audio-driven avatar videos. Sign up for a free API key at fal.ai to begin experimenting with the service.