LTX-2 Video Trainer Training
Input
Hint: Drag and drop files from your computer, images from web pages, paste from clipboard (Ctrl/Cmd+V), or provide a URL.
Customize your input with more control.
Your request will cost $0.0048 per step. For example, the default 2000-step training will cost $9.60.
Training history
Nothing here yet...
Fine-tune your training parameters and start right now.
LTX-2 Trainer User Manual
This guide explains how to use the LTX-2 trainer to fine-tune LoRA adapters for text-to-video generation with optional audio support.
Overview
The LTX-2 trainer fine-tunes LoRA (Low-Rank Adaptation) adapters on the Lightricks LTX-2 model, a 19-billion parameter audio-video generation model. Key features include:
- Audio-video joint training: Train on videos with synchronized audio
- First-frame conditioning: Improve image-to-video generation capabilities
- Scene splitting: Automatically split long videos into trainable clips
- Validation sampling: Generate sample videos during training to monitor progress
Input Parameters Reference
Dataset Configuration
`training_data_url` (required)
Type: `string`
URL to a zip archive containing your training media.
Supported formats:
- Videos:
`.mp4`,`.mov`,`.avi`,`.mkv` - Images:
`.png`,`.jpg`,`.jpeg`
Important: Your dataset must contain ONLY videos OR ONLY images. Mixed datasets are not supported.
Captions: Include a `.txt` file with the same base name as each media file to provide captions:
my_video.mp4 my_video.txt # Contains: "A dog running through a field"
Limit: Maximum 1000 media files per training run. Recommended to have between 15-30 media files in the dataset for most usecases.
LoRA Parameters
`rank`
Type: `8 | 16 | 32 | 64 | 128`
Default: `32`
The rank of the LoRA adaptation matrices. Controls the capacity (expressiveness) of the adapter.
| Rank | Best For |
|---|---|
| 8-16 | Simple style transfer, subtle adjustments |
| 32 | General purpose, recommended starting point |
| 64-128 | Complex concepts if lower ranks don't give good results |
`number_of_steps`
Type: `integer` (100 - 20,000)
Default: `2000`
Total number of training optimization steps.
| Steps | Use Case |
|---|---|
| 1000-2000 | Style transfer with 10-20 videos |
| 2000-6000 | Learning specific subjects/characters |
| 6000+ | Large datasets only (risk of overfitting otherwise) |
`learning_rate`
Type: `float` (1e-6 to 1.0)
Default: `2e-4`
Initial learning rate for the optimizer. The default value works well for most use cases.
Video Configuration
`number_of_frames`
Type: `integer` (9 - 121)
Default: `89`
Number of frames per training sample.
Constraint: Must satisfy `frames % 8 == 1` (valid values: 1, 9, 17, 25, 33, 41, 49, 57, 65, 73, 81, 89, 97, 105, 113, 121). Invalid values are automatically adjusted.
| Frames | Trade-off |
|---|---|
| 9-33 | Faster training, lower memory, less motion learning |
| 65-121 | Better motion learning, more memory, longer training |
`frame_rate`
Type: `integer` (8 - 60)
Default: `25`
Target frames per second. Source videos are resampled to this rate.
- 24-30 FPS: Standard for most content
- 8-15 FPS: Slow-motion or artistic effects
- 48-60 FPS: Smooth, high-quality motion
`resolution`
Type: `"low" | "medium" | "high"`
Default: `"medium"`
Training resolution preset. Combined with `aspect_ratio` to determine pixel dimensions.
| Preset | 16:9 | 1:1 | 9:16 |
|---|---|---|---|
| low | 512×288 | 512×512 | 288×512 |
| medium | 768×448 | 768×768 | 448×768 |
| high | 960×544 | 960×960 | 544×960 |
`aspect_ratio`
Type: `"16:9" | "1:1" | "9:16"`
Default: `"1:1"`
Aspect ratio for training. Videos are center-cropped to match. Train with the aspect ratio you plan to generate at.
Audio Configuration
`with_audio`
Type: `boolean | null`
Default: `null` (auto-detect)
Enable joint audio-video training.
`null`: Automatically detects if input videos have audio`true`: Forces audio training (fails if videos lack audio)`false`: Disables audio even if videos have it
Requirement: Videos must be stereo for training, mono videos are automatically converted to stereo.
`generate_audio_in_validation`
Type: `boolean`
Default: `true`
Whether validation samples should include generated audio.
`audio_normalize`
Type: `boolean`
Default: `true`
Normalize audio peak amplitude to a consistent level (~-0.01 dBFS).
When enabled, all audio tracks are normalized so their peak amplitude reaches approximately 0.999 (just below clipping). This ensures consistent audio levels across your training dataset, preventing loud videos from dominating the training signal.
When to disable:
- If your dataset intentionally has varying volume levels you want to preserve
- If you have clips with intentional silence that should remain quiet
Note: Silent or near-silent audio is not amplified to avoid amplifying background noise.
`audio_preserve_pitch`
Type: `boolean`
Default: `true`
When audio duration doesn't match video duration (after video preprocessing), stretch or compress the audio to match without changing pitch.
`true`: Audio is time-stretched using a phase vocoder algorithm (preserves pitch, sounds natural)`false`: Audio is trimmed (if longer) or padded with silence (if shorter)
| Scenario | With Preservation | Without Preservation |
|---|---|---|
| Audio longer than video | Compressed, pitch unchanged | Trimmed, content lost |
| Audio shorter than video | Stretched, pitch unchanged | Padded with silence |
| Speech/music | Sounds natural (faster/slower) | May cut off or have silence gaps |
Recommendation: Keep enabled for best quality, especially if your videos have varying frame rates or if video duration changes during preprocessing (e.g., due to scene splitting or frame count adjustments).
First Frame Conditioning
`first_frame_conditioning_p`
Type: `float` (0.0 - 1.0)
Default: `0.5`
Probability of conditioning on the first frame during training. Improves image-to-video generation.
| Value | Effect |
|---|---|
| 0.0 | Pure text-to-video, no I2V capability |
| 0.5 | Balanced T2V and I2V (recommended) |
| 0.8-1.0 | Strong I2V focus, may reduce T2V quality |
Trigger Phrase
`trigger_phrase`
Type: `string`
Default: `""` (empty)
A phrase prepended to all captions that activates the LoRA style during inference.
Best practices:
- Use unique phrases like
`"in the style of MYSTYLE"` - Keep it short (1-4 words)
- Avoid common words that conflict with other content
Video Processing
`auto_scale_input`
Type: `boolean`
Default: `false`
Automatically scale videos to match target frame count and FPS. Short videos are extended, long videos are compressed. Audio is preserved.
Enable when your videos have varying lengths/FPS.
`split_input_into_scenes`
Type: `boolean`
Default: `true`
Automatically split long videos into scenes using scene detection. Prevents training on clips with abrupt scene changes and increases effective dataset size.
`split_input_duration_threshold`
Type: `float` (1.0 - 60.0)
Default: `30.0`
Duration in seconds above which videos are split into scenes. Only applies when `split_input_into_scenes` is enabled.
Validation Settings
`validation`
Type: `array`
Default: `[]` (empty)
List of validation prompts to generate during training (max 2). Each entry has a `prompt` field and an optional `image_url` for image-to-video validation.
json[ {"prompt": "A cat walking on a beach at sunset"}, {"prompt": "A bird flying over mountains", "image_url": "https://example.com/start.png"} ]
If any entry has `image_url`, all entries must have one.
`validation_negative_prompt`
Type: `string`
Default: `"worst quality, inconsistent motion, blurry, jittery, distorted"`
Negative prompt applied to all validation samples.
`validation_number_of_frames`
Type: `integer` (9 - 121)
Default: `89`
Frame count for validation videos. Must satisfy `frames % 8 == 1`.
`validation_frame_rate`
Type: `integer` (8 - 60)
Default: `25`
Target frames per second for validation videos. Use this to generate validation videos at a different FPS than the training `frame_rate`.
`validation_resolution`
Type: `"low" | "medium" | "high"`
Default: `"high"`
Resolution preset for validation videos.
`validation_aspect_ratio`
Type: `"16:9" | "1:1" | "9:16"`
Default: `"1:1"`
Aspect ratio for validation videos.
`stg_scale`
Type: `float` (0.0 - 3.0)
Default: `1.0`
Spatio-Temporal Guidance scale for validation. Enhances video quality through attention perturbation.
`0.0`: Disable STG`1.0`: Recommended balance`>1.0`: Stronger guidance, may reduce diversity
How the Training Works
Pipeline Overview
Phase 1: Preprocessing
- Download & extract training data archive
- Detect media type (video/image) and audio presence
- Optional: Split long videos into scenes (each scene inherits the original video's caption)
- Resize and crop videos spatially (scale to fill target resolution, then center-crop to exact dimensions) and temporally (extract the first N frames at target FPS, discarding the rest)
Phase 2: Training
For each training step:
- Load a batch of training samples (videos/images with their captions)
- The model learns to generate videos matching the training data
- With probability
`first_frame_conditioning_p`, the model is trained to continue from a given first frame rather than generate from scratch - At regular intervals, validation videos are generated so you can monitor progress
Phase 3: Output
- Save final LoRA weights (.safetensors)
- Save config with trigger phrase
- Combine validation videos with annotations
What Happens to Your Data
Archive handling: Your zip file is downloaded and fully extracted, including any nested zip files inside.
Caption matching: For each video or image, the trainer looks for a `.txt` file with the same name (e.g., `clip.mp4` pairs with `clip.txt`). If no caption file exists, an empty caption is used.
Scene splitting: When enabled, videos longer than `split_input_duration_threshold` are automatically split at scene boundaries. Each resulting clip uses the same caption as the original video. This means a 2-minute video with multiple scenes might become 5-10 separate training samples.
Video fitting: Each video is resized to fill the target resolution (maintaining aspect ratio), then center-cropped to the exact dimensions. Temporally, the video is resampled to the target FPS and the first `number_of_frames` frames are kept—anything beyond that is discarded. If your video is shorter than `number_of_frames` at the target FPS, it will be skipped unless `auto_scale_input` is enabled.
Trigger phrase: If specified, your trigger phrase is prepended to every caption before training (e.g., caption "A dog running" becomes "mystyle A dog running").
LoRA Training
Instead of modifying all 19 billion parameters in the model, LoRA training adds small adapter layers that learn your specific style or concept. The `rank` parameter controls the capacity of these adapters—higher rank means more learning capacity but also more risk of overfitting.
First-Frame Conditioning
When `first_frame_conditioning_p` is greater than 0, some training steps teach the model to continue a video from a given starting image rather than generating from scratch. This improves image-to-video quality when using your trained LoRA.
Audio-Video Joint Training
When audio training is enabled, the model learns to generate synchronized audio and video together. The same LoRA adapters affect both audio and video generation.
Audio preprocessing: Before training, audio is automatically processed to match the video:
-
Duration matching: Audio is stretched or compressed to match video duration. With
`audio_preserve_pitch`enabled (default), this uses a phase vocoder to change speed without altering pitch—so voices still sound natural, just faster or slower. -
Level normalization: With
`audio_normalize`enabled (default), all audio tracks are normalized to a consistent peak level. This ensures the model learns from consistent audio regardless of the original recording levels.
Validation Sampling
At regular intervals during training, the trainer generates sample videos using your validation prompts. These samples show how the LoRA is progressing and are combined into a single output video with step numbers and prompts annotated. Use these to monitor for overfitting or underfitting.
Tips for Getting Good Results
Creating a Quality Dataset
Recommended dataset sizes:
| Goal | Size |
|---|---|
| Style transfer | 10-30 videos |
| Character/subject learning | 10-30 videos |
| Complex concept | up to 100 videos |
Video quality checklist:
- Consistent style across all videos
- Clear visibility of subjects
- Good lighting (avoid extreme dark/bright)
- Stable footage (minimize camera shake)
- Relevant content only
Caption best practices:
- Be descriptive but concise (1-3 sentences)
- Describe what's visible: subjects, actions, setting, style
- Include motion descriptions: "walking", "camera pans left"
- Note visual style: "cinematic lighting", "anime style"
- Use consistent terminology across captions
Good caption:
A golden retriever running through a sunlit meadow, wildflowers swaying. The camera follows at a low angle. Warm golden hour lighting.
Poor caption:
dog running
Using trigger phrases:
For subject/object LoRAs (e.g., a specific character or product), use a unique token in your captions to represent the subject, and set it as your trigger phrase. For example, if training on videos of a specific person, use a made-up token like "sks" or "ohwx" in your captions: "A sks person walking down the street." This helps the model associate the learned appearance with a specific word you can use at inference time.
For style LoRAs (e.g., anime style, vintage film look), use a phrase that describes the style as your trigger, such as "in mystyle" or "vintage8mm style". Your captions would then be: "in mystyle A cat sitting on a windowsill" so the model learns to associate the visual style with that phrase.
Captions and scene splitting:
Be careful when using scene splitting with detailed captions. Since each scene inherits the original video's caption, the caption may no longer accurately describe the content after splitting. For example, if your caption says "A woman walks through a park, then sits on a bench" and the video is split into two scenes, both scenes will have this caption—but the first scene might only show walking, and the second only sitting. Consider using more general captions that apply to all scenes, or disable scene splitting if your captions are very specific.
Caption format at inference:
When using your trained LoRA, write prompts in the same style as your training captions. If your training captions were detailed and descriptive, use detailed prompts. If they were short, use short prompts. The model learns to associate your style/subject with the caption patterns it saw during training, so matching that format helps activate the LoRA effectively.
Recommended Starting Configuration
json{ "rank": 32, "number_of_steps": 2000, "learning_rate": 2e-4, "number_of_frames": 89, "resolution": "medium", "first_frame_conditioning_p": 0.5, "split_input_into_scenes": true, "audio_normalize": true, "audio_preserve_pitch": true }
Diagnosing Issues
Overfitting signs:
- Validation videos look exactly like training data
- Artifacts or "copied" training frames
Solutions: Reduce steps, reduce rank, increase dataset diversity
Underfitting signs:
- LoRA has minimal effect on generated videos
- Style/subject not learned
Solutions: Increase steps, increase rank, improve captions
Training failures or slow training:
- OOM crashes
- Very slow training
Solutions: Reduce resolution to "low" or "medium", reduce frame count, reduce rank
Validation Prompt Tips
Write prompts that:
- Use the trigger phrase
- Test different scenarios
- Mix simple and complex compositions
Example for a style LoRA:
json[ {"prompt": "in mystyle A woman walking through a city at night, neon reflections"}, {"prompt": "in mystyle A bird flying over a mountain lake at sunrise"} ]
Audio Processing Tips
Dataset preparation for audio:
- Ensure source videos have clear audio without excessive background noise
- Audio should be properly synced with video content
- Stereo audio works best (mono is automatically converted)
Understanding audio preprocessing:
- Videos with varying durations are time-stretched to match the target frame count
- With
`audio_preserve_pitch=true`, a 3-second clip stretched to 4 seconds will sound 25% slower but at the same pitch - With
`audio_normalize=true`, a quiet recording and a loud recording will both have similar peak levels after preprocessing
When to adjust defaults:
- Disable
`audio_normalize`if volume variations are intentional (e.g., ASMR content with whispers) - Disable
`audio_preserve_pitch`only if you don't care about natural audio quality or if your clips already match the target duration exactly
Common Pitfalls
- Too few samples: Less than 10 videos rarely works well
- Inconsistent captions: Generic/random captions hurt learning
- Mixed content: Unrelated videos dilute the learned concept
- Too many steps: Watch validation—more steps can mean overfitting
- Wrong aspect ratio: Train with the ratio you'll generate at
- Ignoring validation: Always monitor validation outputs
- Inconsistent audio levels: If your dataset has videos recorded at vastly different volumes, training may be inconsistent. Enable
`audio_normalize`(default) to fix this. - Unnatural audio speed: If videos are resized/trimmed and
`audio_preserve_pitch`is disabled, speech may sound chipmunk-like or have awkward silence. Keep it enabled (default) for natural-sounding audio.