LTX Video 2.0 Pro Image to Video Prompt Guide

Prompting for Motion, Not Description

Image-to-video generation requires a fundamental shift in how you construct prompts. Unlike text-to-image models where you describe what you want to see, LTX Video 2.0 Pro interprets your prompt as instructions for temporal evolution. Your source image already contains the visual information. Your prompt directs what happens next.

This distinction separates professional results from artifacts. When you provide an image of a woman on a street, the model sees the woman and the street. Your prompt should specify what changes: how the camera moves, how light shifts, how elements within the frame respond to time passing. The underlying architecture achieves this through a transformer-based latent diffusion approach that integrates Video-VAE and denoising operations holistically, enabling full spatiotemporal self-attention across frames¹. The Pro variant supports resolutions up to 2160p (4K), synchronized audio generation, frame rates of 25 or 50 FPS, and durations of 6, 8, or 10 seconds.

Quick Start

A minimal API call demonstrates the core pattern:

const result = await fal.subscribe("fal-ai/ltx-2/image-to-video", {
  input: {
    image_url: "https://your-image-url.jpg",
    prompt:
      "The camera slowly dollies in toward her face as city lights flicker behind her.",
    duration: 6,
    resolution: "1080p",
    fps: 25,
    generate_audio: true,
  },
});

Notice that the prompt describes motion and camera behavior, not the subject. The image_url must be publicly accessible or a base64 data URI. Supported formats include PNG, JPEG, WebP, AVIF, and HEIF. The output is fixed at 16:9 aspect ratio regardless of input dimensions, so prepare source images accordingly to avoid unexpected cropping.

fal^{MODEL APIs}

The fastest, cheapest and most reliable way to run genAI models. 1 API, 100s of models

Build

fal^SERVERLESS

Scale custom models and apps to thousands of GPUs instantly

Deploy

fal^COMPUTE

A fully controlled GPU cloud for enterprise AI training + research

Train

Three-Layer Prompt Architecture

Effective prompts for image-to-video generation follow a consistent structure that separates concerns across three distinct layers:

Subject Action Layer: Define what moves and how. "The lighthouse keeper walks along the rocky shore" establishes subject motion clearly.

Camera Movement Layer: Describe perspective shifts using cinematographic terminology:

Dolly: Camera physically moves toward or away from subject
Track: Camera moves laterally alongside subject movement
Pan: Camera rotates horizontally on a fixed axis
Tilt: Camera rotates vertically on a fixed axis
Orbit: Camera moves in an arc around the subject

Environmental Dynamics Layer: Add atmospheric elements that enhance temporal realism. "Waves crash against rocks in the foreground while seabirds circle overhead."

A complete prompt combining all three layers:

"A lighthouse keeper walks along the rocky shore at sunset. The camera tracks alongside him with smooth lateral movement. Waves crash against the rocks in the foreground while seabirds circle overhead, and the golden light gradually intensifies as the sun dips toward the horizon."

Each layer provides specific guidance without redundancy. The model responds to this structure because it separates spatial information (already in your image) from temporal evolution (what you are prompting for).

Parameter Configuration

LTX Video 2.0 Pro offers configurable parameters that affect output quality and cost. The Pro variant prioritizes fidelity over speed. For rapid iteration during development, consider using LTX Video 2.0 Fast at roughly half the cost, then switch to Pro for final output.

Parameter	Options	Notes
duration	6, 8, 10 seconds	Longer durations beyond 10s require Fast variant
resolution	1080p, 1440p, 2160p	Cost scales with resolution
fps	25, 50	50 FPS doubles frame count
generate_audio	true, false	Generates ambient sound matching visual cues

The generate_audio parameter produces synchronized ambient soundscapes and effects that match on-screen motion. This is not music generation; expect environmental sounds like footsteps, wind, or traffic that correspond to visual elements in your prompt.

Troubleshooting Common Artifacts

When generation produces unsatisfactory results, the cause typically maps to a specific prompt or parameter issue:

Artifact	Likely Cause	Fix
Frozen regions	Empty areas lack motion prompts	Add explicit motion: "clouds drift across the sky"
Jittery motion	Conflicting movement directions	Simplify to single, clear motion vector
Subject distortion	Physics-violating prompt	Ground motion in realistic constraints
Inconsistent lighting	Vague environmental description	Specify light source and behavior

Redundant Description: The most common mistake. "A red car parked on a street" wastes tokens when your image shows exactly that. Instead: "The camera pans across the car's profile as reflections shift across the polished surface."

Vague Camera Language: "The camera moves around" provides insufficient direction. Specify movement type: dolly, track, pan, tilt, orbit. Precision in cinematographic terminology yields predictable results.

Motion Intensity Control

Research on temporal consistency in video diffusion demonstrates that conditioning signals strongly influence frame coherence². Use qualitative descriptors to calibrate animation intensity:

Restrained: "subtle," "gentle," "slight," "imperceptible"
Moderate: "steady," "gradual," "measured," "smooth"
Dynamic: "dramatic," "rapid," "sweeping," "vigorous"

For 8-10 second durations, structure prompts with sequential phases: "Initially, the camera holds steady. After a moment, it begins a slow zoom. As the zoom continues, the subject turns their head toward the light source." This distributes motion across the timeline rather than front-loading all movement.

Prompt Examples by Content Type

Portrait Animation: "The subject's eyes slowly shift to look directly at the camera. A gentle breeze causes strands of hair to move softly across her face. Natural light from a window creates subtle shadows that shift imperceptibly."

Restraint matters with portraits. Small, natural movements enhance the composition without overwhelming it.

Product Visualization: "The camera orbits 180 degrees around the product in a smooth arc. Studio lights create evolving highlights and reflections across the surface. A subtle depth-of-field effect keeps the product sharp while the background gently blurs."

Architectural Walkthrough: "The camera glides forward through the entrance in a steady dolly movement. Sunlight streams through windows, casting dynamic shadows as the perspective shifts. Dust particles float visibly in the light beams."

Negative Space Animation: When your source image contains empty areas such as sky, water, or plain backgrounds, explicitly prompt for motion: "The empty sky fills with drifting clouds" or "Ripples spread across the still water surface." This prevents static regions from appearing frozen against animated elements.

Iteration Strategy

Start with baseline prompts using default parameters (6 seconds, 1080p, 25 FPS), then iterate systematically. Change one variable at a time: first motion description, then camera work, then environmental elements. This identifies which components most influence your specific image type.

For production workflows, use Fast variant for initial iterations at lower cost, then render final output with Pro. Compare the seed value returned in results to reproduce specific outputs. For alternative approaches to image-to-video generation, explore Kling 2.1 Master or Luma Dream Machine.

LTX Video 2.0 Pro Image to Video Prompting

Prompting for Motion, Not Description

Quick Start

falMODEL APIs

falSERVERLESS

falCOMPUTE