Hunyuan Avatar Image to Video
Input
Hint: Drag and drop audio files from your computer, audio from web pages, paste from clipboard (Ctrl/Cmd+V), or provide a URL. Accepted file types: mp3, ogg, wav, m4a, aac
Hint: Drag and drop image files from your computer, images from web pages, paste from clipboard (Ctrl/Cmd+V), or provide a URL. Accepted file types: jpg, jpeg, png, webp, gif, avif


Customize your input with more control.
Result
What would you like to do next?
For 5s video your request will cost $1.40
Logs
Readme
Hunyuan Avatar - High-Fidelity Audio-Driven Human Animation
Transform audio and images into high-quality AI avatar videos with Hunyuan Avatar, an advanced audio-driven human animation model designed for creating dynamic, emotion-controllable, and multi-character dialogue videos.
Overview
HunyuanAvatar is a High-Fidelity Audio-Driven Human Animation model for Multiple Characters. The model excels at generating highly dynamic videos while preserving character consistency, achieving precise emotion alignment between characters and audio, and enabling multi-character audio-driven animation through innovative multimodal diffusion transformer (MM-DiT) architecture.
Key Capabilities
Create production-ready avatar videos with:
Character Consistency Preservation
- Generate dynamic videos while maintaining strong character consistency
- Character image injection module eliminates condition mismatch between training and inference
- Fine-tune facial characteristics across different poses and expressions
Audio-Driven Animation
- High-fidelity audio-driven human animation capabilities
- Audio Emotion Module (AEM) extracts and transfers emotional cues from reference images
- Face-Aware Audio Adapter (FAA) enables independent audio injection for multi-character scenarios
Multi-Character Support
- Generate multi-character dialogue videos from single inputs
- Independent audio injection via cross-attention for multiple characters
- Realistic avatars in dynamic, immersive scenarios
Getting Started
First, install the fal.ai client library:
bash
# Using npm npm install --save @fal-ai/client # Using pip pip install fal-client
Set up authentication:
javascript
import { fal } from "@fal-ai/client"; fal.config({ credentials: "YOUR_FAL_KEY" });
Generate your first avatar video:
javascript
const result = await fal.subscribe("fal-ai/hunyuan-avatar", { input: { audio_url: "https://your-audio-file.com/audio.wav", image_url: "https://your-image.com/person.jpg" } }); console.log(result.data.video.url);
API Parameters
Required Parameters
- audio_url: The URL of the audio file (supported formats: mp3, ogg, wav, m4a, aac)
- image_url: The URL of the reference image (supported formats: jpg, jpeg, png, webp, gif, avif)
Optional Parameters
- text: Text prompt describing the scene (default: "A cat is singing.")
- num_frames: Number of video frames to generate at 25 FPS (default: 129)
- num_inference_steps: Number of inference steps for sampling (default: 30)
- turbo_mode: Enable faster processing (default: true)
Input Example
javascript
{ "audio_url": "https://v3.fal.media/files/koala/80RpP2FOhXZUV3NRKUWZu_2.WAV", "image_url": "https://fal.media/files/tiger/Y8EgvVqxORBCqWC1OlX3D_3c4c8bbe7f3941a2aea93e278ba14803.jpg", "text": "A professional speaking confidently", "num_frames": 129, "num_inference_steps": 30, "turbo_mode": true }
File Handling
Hunyuan Avatar supports multiple input methods:
URL Input
javascript
input: { audio_url: "https://publicly-accessible-audio.com/file.wav", image_url: "https://publicly-accessible-image.com/person.jpg" }
File Upload via fal Storage
javascript
const audioFile = new File([audioData], "audio.wav", { type: "audio/wav" }); const imageFile = new File([imageData], "image.jpg", { type: "image/jpeg" }); const audioUrl = await fal.storage.upload(audioFile); const imageUrl = await fal.storage.upload(imageFile); const result = await fal.subscribe("fal-ai/hunyuan-avatar", { input: { audio_url: audioUrl, image_url: imageUrl } });
Queue Management
For production applications, use the queue API:
javascript
// Submit request const { request_id } = await fal.queue.submit("fal-ai/hunyuan-avatar", { input: { audio_url: "https://your-audio-file.com/audio.wav", image_url: "https://your-image.com/person.jpg" }, webhookUrl: "https://optional.webhook.url/for/results" }); // Check status const status = await fal.queue.status("fal-ai/hunyuan-avatar", { requestId: request_id, logs: true }); // Get result when complete const result = await fal.queue.result("fal-ai/hunyuan-avatar", { requestId: request_id });
Output Format
javascript
{ "video": { "url": "https://v3.fal.media/files/monkey/3ODbdqHHQL3SvgRXEJXQ-_hunava_8333d613-d4e3-42ff-be36-1e97775621ba_audio.mp4", "content_type": "video/mp4", "file_name": "output_with_audio.mp4", "file_size": 1646349 } }
Best Practices for Optimal Results
Audio Quality Optimization
- Use clear, high-quality audio files for better lip-sync results
- Supported audio formats: mp3, ogg, wav, m4a, aac
- Ensure audio length matches desired video duration
Image Quality Optimization
- Provide high-resolution reference images showing clear facial features
- Use well-lit images with the subject facing the camera
- Supported image formats: jpg, jpeg, png, webp, gif, avif
Technical Implementation
- Implement proper error handling for API responses
- Monitor processing time (approximately 8 minutes per generation)
- Handle rate limits appropriately in production environments
- Use webhooks for long-running requests
Technical Specifications
Model Architecture:
- Base: Multimodal Diffusion Transformer (MM-DiT)
- Innovations: Character Image Injection Module, Audio Emotion Module (AEM), Face-Aware Audio Adapter (FAA)
- Processing Time: ~8 minutes average
- Frame Rate: 25 FPS
Key Innovations:
- Character image injection module for consistency
- Audio Emotion Module for emotion alignment
- Face-Aware Audio Adapter for multi-character scenarios
Pricing and Usage
- Cost: $1.40 for 5s video
- Processing Time: This generation takes approximately 8m
- Commercial Use: Generated content can be used commercially
- Billing: Pay only for successful generations
Applications
HunyuanAvatar supports various downstream tasks:
- E-commerce product demonstrations
- Online streaming and content creation
- Social media video production
- Video content creation and editing
- Multi-character dialogue videos
- Talking avatar videos
Support and Resources
Get help and learn more:
- Technical Documentation: docs.fal.ai
- Model Information: GitHub Repository
- Research Paper: arXiv:2505.20156
- Support: [email protected]
Model Variants
Related Hunyuan models available:
- fal-ai/hunyuan-video: Text-to-video generation
- fal-ai/hunyuan-custom: Custom video generation with identity consistency
- fal-ai/hunyuan3d: Image-to-3D generation
- fal-ai/hunyuan-video-lora-training: LoRA training for Hunyuan Video
Start building with Hunyuan Avatar today to create dynamic, audio-driven avatar videos. Sign up for a free API key at fal.ai to begin experimenting with the service.