Hunyuan Avatar Image to Video
Input
Hint: you can drag and drop file(s) here, or provide a base64 encoded data URL Accepted file types: mp3, ogg, wav, m4a, aac
Hint: you can drag and drop file(s) here, or provide a base64 encoded data URL Accepted file types: jpg, jpeg, png, webp, gif, avif


Customize your input with more control.
Result
What would you like to do next?
For 5s video your request will cost $1.40
Logs
Readme
Hunyuan Avatar - High-Fidelity Audio-Driven Human Animation
Transform audio and images into high-quality AI avatar videos with Hunyuan Avatar, an advanced audio-driven human animation model designed for creating dynamic, emotion-controllable, and multi-character dialogue videos.
Overview
HunyuanAvatar is a High-Fidelity Audio-Driven Human Animation model for Multiple Characters. The model excels at generating highly dynamic videos while preserving character consistency, achieving precise emotion alignment between characters and audio, and enabling multi-character audio-driven animation through innovative multimodal diffusion transformer (MM-DiT) architecture.
Key Capabilities
Create production-ready avatar videos with:
Character Consistency Preservation
- Generate dynamic videos while maintaining strong character consistency
- Character image injection module eliminates condition mismatch between training and inference
- Fine-tune facial characteristics across different poses and expressions
Audio-Driven Animation
- High-fidelity audio-driven human animation capabilities
- Audio Emotion Module (AEM) extracts and transfers emotional cues from reference images
- Face-Aware Audio Adapter (FAA) enables independent audio injection for multi-character scenarios
Multi-Character Support
- Generate multi-character dialogue videos from single inputs
- Independent audio injection via cross-attention for multiple characters
- Realistic avatars in dynamic, immersive scenarios
Getting Started
First, install the fal.ai client library:
Set up authentication:
Generate your first avatar video:
API Parameters
Required Parameters
- audio_url: The URL of the audio file (supported formats: mp3, ogg, wav, m4a, aac)
- image_url: The URL of the reference image (supported formats: jpg, jpeg, png, webp, gif, avif)
Optional Parameters
- text: Text prompt describing the scene (default: "A cat is singing.")
- num_frames: Number of video frames to generate at 25 FPS (default: 129)
- num_inference_steps: Number of inference steps for sampling (default: 30)
- turbo_mode: Enable faster processing (default: true)
Input Example
File Handling
Hunyuan Avatar supports multiple input methods:
URL Input
File Upload via fal Storage
Queue Management
For production applications, use the queue API:
Output Format
Best Practices for Optimal Results
Audio Quality Optimization
- Use clear, high-quality audio files for better lip-sync results
- Supported audio formats: mp3, ogg, wav, m4a, aac
- Ensure audio length matches desired video duration
Image Quality Optimization
- Provide high-resolution reference images showing clear facial features
- Use well-lit images with the subject facing the camera
- Supported image formats: jpg, jpeg, png, webp, gif, avif
Technical Implementation
- Implement proper error handling for API responses
- Monitor processing time (approximately 8 minutes per generation)
- Handle rate limits appropriately in production environments
- Use webhooks for long-running requests
Technical Specifications
Model Architecture:
- Base: Multimodal Diffusion Transformer (MM-DiT)
- Innovations: Character Image Injection Module, Audio Emotion Module (AEM), Face-Aware Audio Adapter (FAA)
- Processing Time: ~8 minutes average
- Frame Rate: 25 FPS
Key Innovations:
- Character image injection module for consistency
- Audio Emotion Module for emotion alignment
- Face-Aware Audio Adapter for multi-character scenarios
Pricing and Usage
- Cost: $1.40 for 5s video
- Processing Time: This generation takes approximately 8m
- Commercial Use: Generated content can be used commercially
- Billing: Pay only for successful generations
Applications
HunyuanAvatar supports various downstream tasks:
- E-commerce product demonstrations
- Online streaming and content creation
- Social media video production
- Video content creation and editing
- Multi-character dialogue videos
- Talking avatar videos
Support and Resources
Get help and learn more:
- Technical Documentation: docs.fal.ai
- Model Information: GitHub Repository
- Research Paper: arXiv:2505.20156
- Support: support@fal.ai
Model Variants
Related Hunyuan models available:
- fal-ai/hunyuan-video: Text-to-video generation
- fal-ai/hunyuan-custom: Custom video generation with identity consistency
- fal-ai/hunyuan3d: Image-to-3D generation
- fal-ai/hunyuan-video-lora-training: LoRA training for Hunyuan Video
Start building with Hunyuan Avatar today to create dynamic, audio-driven avatar videos. Sign up for a free API key at fal.ai to begin experimenting with the service.