Optimal Aurora prompts describe visuals, not speech. Structure them as: resolution, framing, background, lighting, then body constraints. Keep backgrounds simple and avoid hand gestures.
Directing Avatar Videos Through Text
Creatify Aurora generates high-fidelity avatar videos from a single image and audio file, producing synchronized lip movements, natural expressions, and full-body gestures. Creatify describes the model as a diffusion transformer with audio-driven temporal alignment, separating identity preservation from audio synchronization so each component can optimize independently1.
The prompt field operates differently than in text-to-image models. Rather than describing what the avatar should say (that information comes from your audio file), prompts describe the visual context: lighting, framing, background, and presentation style. Effective prompting requires understanding this distinction.
API Integration
The model accepts three inputs: image_url (your avatar), audio_url (speech or singing), and an optional prompt guiding visual generation. A minimal integration:
const result = await fal.subscribe("fal-ai/creatify/aurora", {
input: {
image_url: "https://example.com/avatar.png",
audio_url: "https://example.com/speech.mp3",
prompt: "4K studio interview, medium close-up, soft key-light",
},
});
console.log(result.video.url); // Download URL for generated video
The response returns a video object containing url, file_name, and content_type fields. Common errors include 422 status for invalid input parameters (unsupported file formats, malformed URLs) and 401 for authentication failures. Validate that your image and audio URLs are publicly accessible before submitting requests.
Input Requirements
| Input | Accepted Formats | Notes |
|---|---|---|
| Image | jpg, jpeg, png, webp, gif, avif | Single reference frame for avatar |
| Audio | mp3, ogg, wav, m4a, aac | Duration determines output length |
Pricing
Video generation costs $0.10 per second at 480p or $0.14 per second at 720p. Seconds are rounded upward for billing, so a 9.4-second generation costs the same as 10 seconds. Use 480p for iteration and reserve 720p for final renders.
falMODEL APIs
The fastest, cheapest and most reliable way to run genAI models. 1 API, 100s of models
Prompt Architecture
Prompts that produce professional results follow a consistent structure:
- Quality specifications establishing technical standards (resolution, sharpness)
- Shot type and framing defining avatar positioning
- Background and environment description
- Lighting direction and characteristics
- Subject behavior constraints limiting unnecessary movement
- Camera details including focus and stability
Consider the difference between these prompts:
Weak: "Professional video, good lighting, person talking"
Strong: "4K studio interview, medium close-up (shoulders-up crop). Solid light-grey backdrop, uniform soft key-light, no lighting change. Presenter faces lens, steady eye-contact. Hands remain below frame, body perfectly still. Ultra-sharp."
The stronger prompt succeeds because each phrase provides actionable direction. "Solid light-grey backdrop" implies professional studio conditions. "Hands remain below frame" prevents the model from attempting hand generation, where avatar models commonly struggle2. "No lighting change" prevents temporal inconsistencies.
Effective Prompt Patterns
Professional Presentations
Corporate and educational content requires stability and controlled delivery:
"High-definition corporate headshot, tight framing from mid-chest up. Pure white backdrop, soft beauty lighting from 45-degree angle. Subject maintains direct eye contact, shoulders square to camera. No hand gestures, minimal body sway. Crisp focus, professional polish."
This pattern constrains the model to focus on perfecting facial animation rather than generating body movement.
Conversational Content
Creator-style videos benefit from prompts that permit natural movement while maintaining clarity:
"Natural vlogger aesthetic, medium close-up framing. Soft-focus background suggesting home office environment, warm golden hour lighting quality. Relaxed posture, occasional subtle head tilts during emphasis. Authentic energy while maintaining clear facial features."
Marketing Videos
Sales and promotional content balances energy with precision:
"Premium product launch aesthetic, shoulders-up framing. Gradient background transitioning from deep blue to lighter tone. Professional key light with subtle rim lighting for depth. Confident direct address to camera. Broadcast television quality."
Common Failures and Corrections
Complex backgrounds: Prompts requesting detailed environments ("bustling office with visible coworkers") produce artifacts that distract from the avatar. Specify "solid," "soft-focus," or simple gradient backgrounds instead.
Missing stability cues: Without explicit instructions like "body perfectly still" or "locked position," the model may generate subtle movements that appear unnatural. Always include stillness constraints for professional content.
Lighting inconsistency: Prompts that omit lighting stability phrases can produce flickering or shifting shadows. Include "no lighting change" or "uniform illumination" to maintain temporal consistency.
Hand gesture requests: Current audio-driven avatar models handle facial animation far more reliably than full-body movement2. Prompts asking for "expressive hand gestures" often produce unnatural results. Specify "hands below frame" or "hands out of shot."
Vague quality terms: "Good quality" provides no actionable guidance. Replace with specific terms: "4K resolution," "ultra-sharp focus," "broadcast quality."
Audio-visual mismatch: Energetic audio paired with a prompt describing "calm, meditative presence" creates jarring disconnects. Match prompt tone to audio content.
Workflow Optimization
Begin with a straightforward prompt and generate your first video at 480p. Evaluate the output against three criteria: Does the lip sync match the audio? Does the visual style match your prompt description? Are there any temporal artifacts like flickering or unnatural movements?
If lip synchronization feels imprecise, simplify your prompt to reduce competing directives. Complex prompts that specify many visual elements can sometimes interfere with audio alignment. If visual output diverges from your description, add more specific terms. "Soft lighting" is less actionable than "soft key-light from 45-degree angle, no harsh shadows."
Document what works for different avatar images and audio types. Some avatars respond better to specific prompt structures. High-contrast images with clear facial features typically produce more stable results than low-resolution or partially obscured faces.
Generation time varies based on queue depth and system load. For deadline-sensitive work, build in buffer time and generate critical assets during off-peak hours when possible. Once you have validated your prompt at 480p, regenerate final deliverables at 720p.
Production Checklist
Before submitting a generation request, verify your prompt addresses:
- Resolution and quality specifications
- Exact framing and shot type
- Background simplicity (solid colors or soft gradients)
- Lighting direction and consistency
- Body movement constraints
- Camera stability instructions
The distinction between amateur and professional results rarely stems from the underlying technology. It emerges from precision in communication. Aurora interprets your prompt as a production brief. The more specific your direction, the more closely the output matches your intent.
Recently Added
References
-
Creatify. "Introducing the Aurora Model: Audio-Driven Ultra-Realistic Rendering of Reactive Avatars." Creatify, 2024. https://creatify.ai/introducing-aurora ↩
-
Zhang, H., et al. "LetsTalk: Latent Diffusion Transformer for Talking Video Synthesis." arXiv:2411.16748, 2024. https://arxiv.org/abs/2411.16748 ↩ ↩2























