Train custom LTX-2 video models by treating captions as instructions rather than labels. Use distinctive trigger phrases, detailed cinematographic descriptions, and properly configured frame counts to teach specific styles, motion patterns, and visual transformations. Training costs $0.0048 per step, with 2000-step runs completing in minutes on fal.
Training Video Models Through Prompts
The LTX-2 Video Trainer operates on a fundamentally different paradigm than standard text-to-video generation. Rather than synthesizing content from textual descriptions, you are instructing the model to recognize and reproduce specific transformations, styles, or visual patterns through video-to-video learning. This architectural distinction shapes every aspect of prompt engineering for custom model training.
LTX-2 is built on a Diffusion Transformer architecture with approximately 19 billion parameters, designed to generate synchronized audiovisual content1. When training a custom model, your captions function as teaching material rather than generation prompts. A sparse caption like "person walking" provides minimal training signal. A detailed description such as "medium shot of person walking confidently through urban street, cinematic depth of field, golden hour lighting, smooth tracking camera movement" gives the model rich, structured information to learn from.
Dataset Requirements
Your training data URL must point to a zip archive containing media files with corresponding caption text files. The following constraints apply due to LTX-2's VAE architecture:
| Requirement | Specification |
|---|---|
| Content type | Exclusively videos OR exclusively images (never mixed) |
| Spatial dimensions | Width and height must be multiples of 32 |
| Frame count | Must satisfy frames % 8 == 1 (valid: 9, 17, 25, 33, 41, 49, 57, 65, 73, 81, 89, 97, 121) |
| Caption pairing | Each media file requires a .txt file with matching filename |
| Minimum dataset size | 15-20 samples for style transfer; more for complex subjects |
Memory usage scales with both spatial and temporal dimensions. For detailed, high-quality output, use larger spatial dimensions (768x448) with fewer frames (89). For motion-focused training, use smaller spatial dimensions (512x512) with more frames (121).
falMODEL APIs
The fastest, cheapest and most reliable way to run genAI models. 1 API, 100s of models
Trigger Phrase Architecture
The trigger phrase parameter prepends to every caption during training, establishing a unique activation key for your model's learned behaviors. This phrase must appear during inference to activate learned patterns.
Effective trigger phrases share several characteristics:
- Distinctiveness with semantic meaning ("cinematic drone footage" rather than arbitrary terms)
- Alignment with training objectives ("premium product showcase" for product videos)
- Consistency between training and inference
The trigger phrase creates a namespace for your custom model's knowledge, preventing conflicts with base model training.
Caption Strategies by Training Objective
Caption structure should align precisely with your training goals.
Style transfer training prioritizes visual aesthetics over scene content. When training on anime-style videos, emphasize "vibrant color palette, cel-shaded rendering, dynamic action lines, exaggerated motion blur" rather than describing narrative content.
Motion pattern training requires explicit description of camera and subject dynamics. Descriptors like "smooth dolly-in shot, subject remains centered, gradual background defocus" communicate temporal and spatial relationships the model should internalize.
Subject consistency training balances identity markers with contextual variety: "The red sports car accelerating on highway, low angle, sunset lighting" paired with "The red sports car parked in urban setting, eye-level shot, overcast day" teaches subject preservation across contexts.
Parameter Configuration and Costs
Training runs asynchronously on fal infrastructure. Submit jobs via the queue API with webhooks for completion notification rather than blocking on results. Training costs $0.0048 per step; a default 2000-step run costs $9.60 and typically completes in minutes.
| Parameter | Range | Tradeoffs |
|---|---|---|
| Training steps | 2000-4000 typical | Higher values improve fidelity but increase cost and overfitting risk |
| Rank | 16, 32, 64, 128 | Higher rank increases learning capacity and memory usage; 32 suits most style transfers, 64-128 for complex subjects |
| Resolution | low, medium, high | Higher resolution requires higher-quality source material and increases training time |
| Frame count | 49-121 (frames % 8 == 1) | Match to content: 49-65 for quick actions, 89-97 for cinematic motion |
Rank controls LoRA adaptation capacity. Low-Rank Adaptation reduces trainable parameters by learning pairs of rank-decomposition matrices while freezing original weights2. Higher rank values increase overfitting risk with limited training data.
Practical Prompt Examples
Product visualization:
Trigger phrase: "premium product showcase"
Caption: "Premium product showcase of smartphone rotating on pedestal, studio lighting with soft key light from left, subtle rim lighting, clean white background, smooth 360-degree rotation, shallow depth of field"
Cinematic style:
Trigger phrase: "cinematic establishing shot"
Caption: "Cinematic establishing shot of mountain landscape at dawn, slow push-in camera movement, atmospheric haze in valleys, warm color grading with teal shadows, anamorphic lens characteristics, film grain texture"
Character animation:
Trigger phrase: "character animation style"
Caption: "Character animation style showing figure walking cycle, bouncy exaggerated motion, clear silhouette, anticipation and follow-through animation principles, vibrant saturated colors, smooth 24fps motion"
Diagnosing Training Issues
Training produces three distinct failure modes with observable symptoms:
Overfitting manifests as literal reproduction of training content, inability to handle novel subjects, or artifacts when prompts deviate from training captions. Reduce training steps or rank.
Underfitting produces weak style application, inconsistent outputs, or models that ignore learned patterns. Increase training steps, verify caption quality, or expand the training dataset.
Dataset problems appear as training failures or erratic outputs. Verify that spatial dimensions are multiples of 32, frame counts satisfy the modulo constraint, and all media files have corresponding caption files.
Successful training shows consistent style application across novel subjects, motion characteristics matching training data, and reliable trigger phrase activation.
Advanced Techniques
LLM-assisted prompt expansion addresses large dataset captioning. Language models can expand basic descriptions into comprehensive captions, ensuring consistent detail density.
Hierarchical detail structuring organizes caption information in a consistent sequence: shot type, subject, action, camera movement, lighting, style and mood. This predictable structure helps the model parse captions effectively.
Negative space description explicitly defines what is absent. Descriptors like "clean composition with minimal background elements, isolated subject, no distracting motion in background" teach intentional simplicity.
Common Mistakes
Several patterns consistently undermine training effectiveness:
- Inconsistent caption detail: Variable caption depth creates conflicting training signals
- Missing temporal information: "Person jumping" omits motion sequence; describe the full action
- Generic descriptors: Words like "beautiful" provide zero training value; use specific terms
- Mismatched parameters: Training 121-frame samples on videos with rapid cuts confuses the model
- Disabled auto_scale_input: Enable for training videos with varying lengths or frame rates
Inference After Training
Always include your trigger phrase at the start of generation prompts, and describe output with vocabulary matching your training captions. The model has learned associations between your caption vocabulary and visual outputs; consistent language activates learned patterns reliably.
Integration
Trained models integrate with production workflows through API calls, ComfyUI workflows, or the open-source LTX-2 codebase. Use webhooks for asynchronous processing: submit training jobs via the queue API, then poll status or receive webhook notifications on completion. The response includes URLs to trained LoRA weights for use with inference endpoints.
Recently Added
References
-
HaCohen, Y., et al. "LTX-2: Efficient Joint Audio-Visual Foundation Model." arXiv preprint arXiv:2601.03233, 2026. https://arxiv.org/abs/2601.03233 ↩
-
Hu, E.J., et al. "LoRA: Low-Rank Adaptation of Large Language Models." arXiv preprint arXiv:2106.09685, 2021. https://arxiv.org/abs/2106.09685 ↩

![Image-to-image editing with LoRA support for FLUX.2 [klein] 9B from Black Forest Labs. Specialized style transfer and domain-specific modifications.](/_next/image?url=https%3A%2F%2Fv3b.fal.media%2Ffiles%2Fb%2F0a8aaeb2%2FFZOclk1jcZaVZAP_C12Qe_edbbb28567484c48bd205f24bafd6225.jpg&w=3840&q=75)
![Image-to-image editing with LoRA support for FLUX.2 [klein] 4B from Black Forest Labs. Specialized style transfer and domain-specific modifications.](/_next/image?url=https%3A%2F%2Fv3b.fal.media%2Ffiles%2Fb%2F0a8aae07%2FWKhXnfsA7BNpDGwCXarGn_52f0f2fdac2c4fc78b2765b6c662222b.jpg&w=3840&q=75)
![Image-to-image editing with Flux 2 [klein] 4B Base from Black Forest Labs. Precise modifications using natural language descriptions and hex color control.](/_next/image?url=https%3A%2F%2Fv3b.fal.media%2Ffiles%2Fb%2F0a8a7f49%2FnKsGN6UMAi6IjaYdkmILC_e20d2097bb984ad589518cf915fe54b4.jpg&w=3840&q=75)
![Text-to-image generation with FLUX.2 [klein] 9B Base from Black Forest Labs. Enhanced realism, crisper text generation, and native editing capabilities.](/_next/image?url=https%3A%2F%2Fv3b.fal.media%2Ffiles%2Fb%2F0a8a7f3c%2F90FKDpwtSCZTqOu0jUI-V_64c1a6ec0f9343908d9efa61b7f2444b.jpg&w=3840&q=75)
![Image-to-image editing with Flux 2 [klein] 9B Base from Black Forest Labs. Precise modifications using natural language descriptions and hex color control.](/_next/image?url=https%3A%2F%2Fv3b.fal.media%2Ffiles%2Fb%2F0a8a7f50%2FX8ffS5h55gcigsNZoNC7O_52e6b383ac214d2abe0a2e023f03de88.jpg&w=3840&q=75)
![Text-to-image generation with Flux 2 [klein] 4B Base from Black Forest Labs. Enhanced realism, crisper text generation, and native editing capabilities.](/_next/image?url=https%3A%2F%2Fv3b.fal.media%2Ffiles%2Fb%2F0a8a7f36%2FbYUAh_nzYUAUa_yCBkrP1_2dd84022eeda49e99db95e13fc588e47.jpg&w=3840&q=75)
![Image-to-image editing with Flux 2 [klein] 4B from Black Forest Labs. Precise modifications using natural language descriptions and hex color control.](/_next/image?url=https%3A%2F%2Fv3b.fal.media%2Ffiles%2Fb%2F0a8a7f40%2F-9rbLPCsz36IFb-4t3J2L_76750002c0db4ce899b77e98321ffe30.jpg&w=3840&q=75)
![Text-to-image generation with Flux 2 [klein] 4B from Black Forest Labs. Enhanced realism, crisper text generation, and native editing capabilities.](/_next/image?url=https%3A%2F%2Fv3b.fal.media%2Ffiles%2Fb%2F0a8a7f30%2FUwGq5qBE9zqd4r6QI7En0_082c2d0376a646378870218b6c0589f9.jpg&w=3840&q=75)








![Flux 2 [klein] Prompt Guide: Sub-Second Image Generation | fal](/_next/image?url=https%3A%2F%2Fv3b.fal.media%2Ffiles%2Fb%2F0a8a9bd5%2FQmraOvhzoPcVdid6feB2t_1768560216470.png&w=828&q=75)