LTX Video 2.0 Pro transforms static images into professional video through structured prompts that describe motion, camera work, and environmental dynamics rather than what's already visible in your source image.
Prompting for Motion, Not Description
Image-to-video generation requires a fundamental shift in how you construct prompts. Unlike text-to-image models where you describe what you want to see, LTX Video 2.0 Pro interprets your prompt as instructions for temporal evolution. Your source image already contains the visual information. Your prompt directs what happens next.
This distinction separates professional results from artifacts. When you provide an image of a woman on a street, the model sees the woman and the street. Your prompt should specify what changes: how the camera moves, how light shifts, how elements within the frame respond to time passing. The underlying architecture achieves this through a transformer-based latent diffusion approach that integrates Video-VAE and denoising operations holistically, enabling full spatiotemporal self-attention across frames1. The Pro variant supports resolutions up to 2160p (4K), synchronized audio generation, frame rates of 25 or 50 FPS, and durations of 6, 8, or 10 seconds.
Quick Start
A minimal API call demonstrates the core pattern:
const result = await fal.subscribe("fal-ai/ltx-2/image-to-video", {
input: {
image_url: "https://your-image-url.jpg",
prompt:
"The camera slowly dollies in toward her face as city lights flicker behind her.",
duration: 6,
resolution: "1080p",
fps: 25,
generate_audio: true,
},
});
Notice that the prompt describes motion and camera behavior, not the subject. The image_url must be publicly accessible or a base64 data URI. Supported formats include PNG, JPEG, WebP, AVIF, and HEIF. The output is fixed at 16:9 aspect ratio regardless of input dimensions, so prepare source images accordingly to avoid unexpected cropping.
falMODEL APIs
The fastest, cheapest and most reliable way to run genAI models. 1 API, 100s of models
Three-Layer Prompt Architecture
Effective prompts for image-to-video generation follow a consistent structure that separates concerns across three distinct layers:
Subject Action Layer: Define what moves and how. "The lighthouse keeper walks along the rocky shore" establishes subject motion clearly.
Camera Movement Layer: Describe perspective shifts using cinematographic terminology:
- Dolly: Camera physically moves toward or away from subject
- Track: Camera moves laterally alongside subject movement
- Pan: Camera rotates horizontally on a fixed axis
- Tilt: Camera rotates vertically on a fixed axis
- Orbit: Camera moves in an arc around the subject
Environmental Dynamics Layer: Add atmospheric elements that enhance temporal realism. "Waves crash against rocks in the foreground while seabirds circle overhead."
A complete prompt combining all three layers:
"A lighthouse keeper walks along the rocky shore at sunset. The camera tracks alongside him with smooth lateral movement. Waves crash against the rocks in the foreground while seabirds circle overhead, and the golden light gradually intensifies as the sun dips toward the horizon."
Each layer provides specific guidance without redundancy. The model responds to this structure because it separates spatial information (already in your image) from temporal evolution (what you are prompting for).
Parameter Configuration
LTX Video 2.0 Pro offers configurable parameters that affect output quality and cost. The Pro variant prioritizes fidelity over speed. For rapid iteration during development, consider using LTX Video 2.0 Fast at roughly half the cost, then switch to Pro for final output.
| Parameter | Options | Notes |
|---|---|---|
| duration | 6, 8, 10 seconds | Longer durations beyond 10s require Fast variant |
| resolution | 1080p, 1440p, 2160p | Cost scales with resolution |
| fps | 25, 50 | 50 FPS doubles frame count |
| generate_audio | true, false | Generates ambient sound matching visual cues |
The generate_audio parameter produces synchronized ambient soundscapes and effects that match on-screen motion. This is not music generation; expect environmental sounds like footsteps, wind, or traffic that correspond to visual elements in your prompt.
Troubleshooting Common Artifacts
When generation produces unsatisfactory results, the cause typically maps to a specific prompt or parameter issue:
| Artifact | Likely Cause | Fix |
|---|---|---|
| Frozen regions | Empty areas lack motion prompts | Add explicit motion: "clouds drift across the sky" |
| Jittery motion | Conflicting movement directions | Simplify to single, clear motion vector |
| Subject distortion | Physics-violating prompt | Ground motion in realistic constraints |
| Inconsistent lighting | Vague environmental description | Specify light source and behavior |
Redundant Description: The most common mistake. "A red car parked on a street" wastes tokens when your image shows exactly that. Instead: "The camera pans across the car's profile as reflections shift across the polished surface."
Vague Camera Language: "The camera moves around" provides insufficient direction. Specify movement type: dolly, track, pan, tilt, orbit. Precision in cinematographic terminology yields predictable results.
Motion Intensity Control
Research on temporal consistency in video diffusion demonstrates that conditioning signals strongly influence frame coherence2. Use qualitative descriptors to calibrate animation intensity:
- Restrained: "subtle," "gentle," "slight," "imperceptible"
- Moderate: "steady," "gradual," "measured," "smooth"
- Dynamic: "dramatic," "rapid," "sweeping," "vigorous"
For 8-10 second durations, structure prompts with sequential phases: "Initially, the camera holds steady. After a moment, it begins a slow zoom. As the zoom continues, the subject turns their head toward the light source." This distributes motion across the timeline rather than front-loading all movement.
Prompt Examples by Content Type
Portrait Animation: "The subject's eyes slowly shift to look directly at the camera. A gentle breeze causes strands of hair to move softly across her face. Natural light from a window creates subtle shadows that shift imperceptibly."
Restraint matters with portraits. Small, natural movements enhance the composition without overwhelming it.
Product Visualization: "The camera orbits 180 degrees around the product in a smooth arc. Studio lights create evolving highlights and reflections across the surface. A subtle depth-of-field effect keeps the product sharp while the background gently blurs."
Architectural Walkthrough: "The camera glides forward through the entrance in a steady dolly movement. Sunlight streams through windows, casting dynamic shadows as the perspective shifts. Dust particles float visibly in the light beams."
Negative Space Animation: When your source image contains empty areas such as sky, water, or plain backgrounds, explicitly prompt for motion: "The empty sky fills with drifting clouds" or "Ripples spread across the still water surface." This prevents static regions from appearing frozen against animated elements.
Iteration Strategy
Start with baseline prompts using default parameters (6 seconds, 1080p, 25 FPS), then iterate systematically. Change one variable at a time: first motion description, then camera work, then environmental elements. This identifies which components most influence your specific image type.
For production workflows, use Fast variant for initial iterations at lower cost, then render final output with Pro. Compare the seed value returned in results to reproduce specific outputs. For alternative approaches to image-to-video generation, explore Kling 2.1 Master or Luma Dream Machine.
Recently Added
References
-
HaCohen, Y., et al. "LTX-Video: Realtime Video Latent Diffusion." arXiv preprint arXiv:2501.00103, 2024. https://arxiv.org/abs/2501.00103 ↩
-
Blattmann, A., et al. "Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets." arXiv preprint arXiv:2311.15127, 2023. https://arxiv.org/abs/2311.15127 ↩

![Fine-tune FLUX.2 [dev] from Black Forest Labs with custom datasets. Create specialized LoRA adaptations for specific editing tasks.](/_next/image?url=https%3A%2F%2Fv3b.fal.media%2Ffiles%2Fb%2Frabbit%2FQQxycBXjY75hch-HBAQKZ_4af8ba3ddb9d457ba5fc51fcd428e720.jpg&w=3840&q=75)
![Fine-tune FLUX.2 [dev] from Black Forest Labs with custom datasets. Create specialized LoRA adaptations for specific styles and domains.](/_next/image?url=https%3A%2F%2Fv3b.fal.media%2Ffiles%2Fb%2Ftiger%2FnYv87OHdt503yjlNUk1P3_2551388f5f4e4537b67e8ed436333bca.jpg&w=3840&q=75)




















