Run the latest models all in one Sandbox 🏖️

LTX-2 Video Trainer Prompt Guide

Explore all models

Train custom LTX-2 video models by treating captions as instructions rather than labels. Use distinctive trigger phrases, detailed cinematographic descriptions, and properly configured frame counts to teach specific styles, motion patterns, and visual transformations. Training costs $0.0048 per step, with 2000-step runs completing in minutes on fal.

last updated
1/14/2026
edited by
Zachary Roth
read time
8 minutes
LTX-2 Video Trainer Prompt Guide

Training Video Models Through Prompts

The LTX-2 Video Trainer operates on a fundamentally different paradigm than standard text-to-video generation. Rather than synthesizing content from textual descriptions, you are instructing the model to recognize and reproduce specific transformations, styles, or visual patterns through video-to-video learning. This architectural distinction shapes every aspect of prompt engineering for custom model training.

LTX-2 is built on a Diffusion Transformer architecture with approximately 19 billion parameters, designed to generate synchronized audiovisual content1. When training a custom model, your captions function as teaching material rather than generation prompts. A sparse caption like "person walking" provides minimal training signal. A detailed description such as "medium shot of person walking confidently through urban street, cinematic depth of field, golden hour lighting, smooth tracking camera movement" gives the model rich, structured information to learn from.

Dataset Requirements

Your training data URL must point to a zip archive containing media files with corresponding caption text files. The following constraints apply due to LTX-2's VAE architecture:

RequirementSpecification
Content typeExclusively videos OR exclusively images (never mixed)
Spatial dimensionsWidth and height must be multiples of 32
Frame countMust satisfy frames % 8 == 1 (valid: 9, 17, 25, 33, 41, 49, 57, 65, 73, 81, 89, 97, 121)
Caption pairingEach media file requires a .txt file with matching filename
Minimum dataset size15-20 samples for style transfer; more for complex subjects

Memory usage scales with both spatial and temporal dimensions. For detailed, high-quality output, use larger spatial dimensions (768x448) with fewer frames (89). For motion-focused training, use smaller spatial dimensions (512x512) with more frames (121).

falMODEL APIs

The fastest, cheapest and most reliable way to run genAI models. 1 API, 100s of models

falSERVERLESS

Scale custom models and apps to thousands of GPUs instantly

falCOMPUTE

A fully controlled GPU cloud for enterprise AI training + research

Trigger Phrase Architecture

The trigger phrase parameter prepends to every caption during training, establishing a unique activation key for your model's learned behaviors. This phrase must appear during inference to activate learned patterns.

Effective trigger phrases share several characteristics:

  • Distinctiveness with semantic meaning ("cinematic drone footage" rather than arbitrary terms)
  • Alignment with training objectives ("premium product showcase" for product videos)
  • Consistency between training and inference

The trigger phrase creates a namespace for your custom model's knowledge, preventing conflicts with base model training.

Caption Strategies by Training Objective

Caption structure should align precisely with your training goals.

Style transfer training prioritizes visual aesthetics over scene content. When training on anime-style videos, emphasize "vibrant color palette, cel-shaded rendering, dynamic action lines, exaggerated motion blur" rather than describing narrative content.

Motion pattern training requires explicit description of camera and subject dynamics. Descriptors like "smooth dolly-in shot, subject remains centered, gradual background defocus" communicate temporal and spatial relationships the model should internalize.

Subject consistency training balances identity markers with contextual variety: "The red sports car accelerating on highway, low angle, sunset lighting" paired with "The red sports car parked in urban setting, eye-level shot, overcast day" teaches subject preservation across contexts.

Parameter Configuration and Costs

Training runs asynchronously on fal infrastructure. Submit jobs via the queue API with webhooks for completion notification rather than blocking on results. Training costs $0.0048 per step; a default 2000-step run costs $9.60 and typically completes in minutes.

ParameterRangeTradeoffs
Training steps2000-4000 typicalHigher values improve fidelity but increase cost and overfitting risk
Rank16, 32, 64, 128Higher rank increases learning capacity and memory usage; 32 suits most style transfers, 64-128 for complex subjects
Resolutionlow, medium, highHigher resolution requires higher-quality source material and increases training time
Frame count49-121 (frames % 8 == 1)Match to content: 49-65 for quick actions, 89-97 for cinematic motion

Rank controls LoRA adaptation capacity. Low-Rank Adaptation reduces trainable parameters by learning pairs of rank-decomposition matrices while freezing original weights2. Higher rank values increase overfitting risk with limited training data.

Practical Prompt Examples

Product visualization:

Trigger phrase: "premium product showcase"
Caption: "Premium product showcase of smartphone rotating on pedestal, studio lighting with soft key light from left, subtle rim lighting, clean white background, smooth 360-degree rotation, shallow depth of field"

Cinematic style:

Trigger phrase: "cinematic establishing shot"
Caption: "Cinematic establishing shot of mountain landscape at dawn, slow push-in camera movement, atmospheric haze in valleys, warm color grading with teal shadows, anamorphic lens characteristics, film grain texture"

Character animation:

Trigger phrase: "character animation style"
Caption: "Character animation style showing figure walking cycle, bouncy exaggerated motion, clear silhouette, anticipation and follow-through animation principles, vibrant saturated colors, smooth 24fps motion"

Diagnosing Training Issues

Training produces three distinct failure modes with observable symptoms:

Overfitting manifests as literal reproduction of training content, inability to handle novel subjects, or artifacts when prompts deviate from training captions. Reduce training steps or rank.

Underfitting produces weak style application, inconsistent outputs, or models that ignore learned patterns. Increase training steps, verify caption quality, or expand the training dataset.

Dataset problems appear as training failures or erratic outputs. Verify that spatial dimensions are multiples of 32, frame counts satisfy the modulo constraint, and all media files have corresponding caption files.

Successful training shows consistent style application across novel subjects, motion characteristics matching training data, and reliable trigger phrase activation.

Advanced Techniques

LLM-assisted prompt expansion addresses large dataset captioning. Language models can expand basic descriptions into comprehensive captions, ensuring consistent detail density.

Hierarchical detail structuring organizes caption information in a consistent sequence: shot type, subject, action, camera movement, lighting, style and mood. This predictable structure helps the model parse captions effectively.

Negative space description explicitly defines what is absent. Descriptors like "clean composition with minimal background elements, isolated subject, no distracting motion in background" teach intentional simplicity.

Common Mistakes

Several patterns consistently undermine training effectiveness:

  • Inconsistent caption detail: Variable caption depth creates conflicting training signals
  • Missing temporal information: "Person jumping" omits motion sequence; describe the full action
  • Generic descriptors: Words like "beautiful" provide zero training value; use specific terms
  • Mismatched parameters: Training 121-frame samples on videos with rapid cuts confuses the model
  • Disabled auto_scale_input: Enable for training videos with varying lengths or frame rates

Inference After Training

Always include your trigger phrase at the start of generation prompts, and describe output with vocabulary matching your training captions. The model has learned associations between your caption vocabulary and visual outputs; consistent language activates learned patterns reliably.

Integration

Trained models integrate with production workflows through API calls, ComfyUI workflows, or the open-source LTX-2 codebase. Use webhooks for asynchronous processing: submit training jobs via the queue API, then poll status or receive webhook notifications on completion. The response includes URLs to trained LoRA weights for use with inference endpoints.

Recently Added

References

  1. HaCohen, Y., et al. "LTX-2: Efficient Joint Audio-Visual Foundation Model." arXiv preprint arXiv:2601.03233, 2026. https://arxiv.org/abs/2601.03233

  2. Hu, E.J., et al. "LoRA: Low-Rank Adaptation of Large Language Models." arXiv preprint arXiv:2106.09685, 2021. https://arxiv.org/abs/2106.09685

about the author
Zachary Roth
A generative media engineer with a focus on growth, Zach has deep expertise in building RAG architecture for complex content systems.

Related articles