LTX-2 Video Trainer | Training

LTX-2 Trainer User Manual

This guide explains how to use the LTX-2 trainer to fine-tune LoRA adapters for text-to-video generation with optional audio support.

Overview

The LTX-2 trainer fine-tunes LoRA (Low-Rank Adaptation) adapters on the Lightricks LTX-2 model, a 19-billion parameter audio-video generation model. Key features include:

Audio-video joint training: Train on videos with synchronized audio
First-frame conditioning: Improve image-to-video generation capabilities
Scene splitting: Automatically split long videos into trainable clips
Validation sampling: Generate sample videos during training to monitor progress

Input Parameters Reference

Dataset Configuration

`training_data_url` (required)

Type: `string`

URL to a zip archive containing your training media.

Supported formats:

Videos: `.mp4`, `.mov`, `.avi`, `.mkv`
Images: `.png`, `.jpg`, `.jpeg`

Important: Your dataset must contain ONLY videos OR ONLY images. Mixed datasets are not supported.

Captions: Include a `.txt` file with the same base name as each media file to provide captions:


my_video.mp4
my_video.txt   # Contains: "A dog running through a field"

Limit: Maximum 1000 media files per training run. Recommended to have between 15-30 media files in the dataset for most usecases.

LoRA Parameters

`rank`

Type: `8 | 16 | 32 | 64 | 128` Default: `32`

The rank of the LoRA adaptation matrices. Controls the capacity (expressiveness) of the adapter.

Rank	Best For
8-16	Simple style transfer, subtle adjustments
32	General purpose, recommended starting point
64-128	Complex concepts if lower ranks don't give good results

`number_of_steps`

Type: `integer` (100 - 20,000) Default: `2000`

Total number of training optimization steps.

Steps	Use Case
1000-2000	Style transfer with 10-20 videos
2000-6000	Learning specific subjects/characters
6000+	Large datasets only (risk of overfitting otherwise)

`learning_rate`

Type: `float` (1e-6 to 1.0) Default: `2e-4`

Initial learning rate for the optimizer. The default value works well for most use cases.

Video Configuration

`number_of_frames`

Type: `integer` (9 - 121) Default: `89`

Number of frames per training sample.

Constraint: Must satisfy `frames % 8 == 1` (valid values: 1, 9, 17, 25, 33, 41, 49, 57, 65, 73, 81, 89, 97, 105, 113, 121). Invalid values are automatically adjusted.

Frames	Trade-off
9-33	Faster training, lower memory, less motion learning
65-121	Better motion learning, more memory, longer training

`frame_rate`

Type: `integer` (8 - 60) Default: `25`

Target frames per second. Source videos are resampled to this rate.

24-30 FPS: Standard for most content
8-15 FPS: Slow-motion or artistic effects
48-60 FPS: Smooth, high-quality motion

`resolution`

Type: `"low" | "medium" | "high"` Default: `"medium"`

Training resolution preset. Combined with `aspect_ratio` to determine pixel dimensions.

Preset	16:9	1:1	9:16
low	512×288	512×512	288×512
medium	768×448	768×768	448×768
high	960×544	960×960	544×960

`aspect_ratio`

Type: `"16:9" | "1:1" | "9:16"` Default: `"1:1"`

Aspect ratio for training. Videos are center-cropped to match. Train with the aspect ratio you plan to generate at.

Audio Configuration

`with_audio`

Type: `boolean | null` Default: `null` (auto-detect)

Enable joint audio-video training.

`null`: Automatically detects if input videos have audio
`true`: Forces audio training (fails if videos lack audio)
`false`: Disables audio even if videos have it

Requirement: Videos must be stereo for training, mono videos are automatically converted to stereo.

`generate_audio_in_validation`

Type: `boolean` Default: `true`

Whether validation samples should include generated audio.

`audio_normalize`

Type: `boolean` Default: `true`

Normalize audio peak amplitude to a consistent level (~-0.01 dBFS).

When enabled, all audio tracks are normalized so their peak amplitude reaches approximately 0.999 (just below clipping). This ensures consistent audio levels across your training dataset, preventing loud videos from dominating the training signal.

When to disable:

If your dataset intentionally has varying volume levels you want to preserve
If you have clips with intentional silence that should remain quiet

Note: Silent or near-silent audio is not amplified to avoid amplifying background noise.

`audio_preserve_pitch`

Type: `boolean` Default: `true`

When audio duration doesn't match video duration (after video preprocessing), stretch or compress the audio to match without changing pitch.

`true`: Audio is time-stretched using a phase vocoder algorithm (preserves pitch, sounds natural)
`false`: Audio is trimmed (if longer) or padded with silence (if shorter)

Scenario	With Preservation	Without Preservation
Audio longer than video	Compressed, pitch unchanged	Trimmed, content lost
Audio shorter than video	Stretched, pitch unchanged	Padded with silence
Speech/music	Sounds natural (faster/slower)	May cut off or have silence gaps

Recommendation: Keep enabled for best quality, especially if your videos have varying frame rates or if video duration changes during preprocessing (e.g., due to scene splitting or frame count adjustments).

First Frame Conditioning

`first_frame_conditioning_p`

Type: `float` (0.0 - 1.0) Default: `0.5`

Probability of conditioning on the first frame during training. Improves image-to-video generation.

Value	Effect
0.0	Pure text-to-video, no I2V capability
0.5	Balanced T2V and I2V (recommended)
0.8-1.0	Strong I2V focus, may reduce T2V quality

Trigger Phrase

`trigger_phrase`

Type: `string` Default: `""` (empty)

A phrase prepended to all captions that activates the LoRA style during inference.

Best practices:

Use unique phrases like `"in the style of MYSTYLE"`
Keep it short (1-4 words)
Avoid common words that conflict with other content

Video Processing

`auto_scale_input`

Type: `boolean` Default: `false`

Automatically scale videos to match target frame count and FPS. Short videos are extended, long videos are compressed. Audio is preserved.

Enable when your videos have varying lengths/FPS.

`split_input_into_scenes`

Type: `boolean` Default: `true`

Automatically split long videos into scenes using scene detection. Prevents training on clips with abrupt scene changes and increases effective dataset size.

`split_input_duration_threshold`

Type: `float` (1.0 - 60.0) Default: `30.0`

Duration in seconds above which videos are split into scenes. Only applies when `split_input_into_scenes` is enabled.

Validation Settings

`validation`

Type: `array` Default: `[]` (empty)

List of validation prompts to generate during training (max 2). Each entry has a `prompt` field and an optional `image_url` for image-to-video validation.

json
[
  {"prompt": "A cat walking on a beach at sunset"},
  {"prompt": "A bird flying over mountains", "image_url": "https://example.com/start.png"}
]

If any entry has `image_url`, all entries must have one.

`validation_negative_prompt`

Type: `string` Default: `"worst quality, inconsistent motion, blurry, jittery, distorted"`

Negative prompt applied to all validation samples.

`validation_number_of_frames`

Type: `integer` (9 - 121) Default: `89`

Frame count for validation videos. Must satisfy `frames % 8 == 1`.

`validation_frame_rate`

Type: `integer` (8 - 60) Default: `25`

Target frames per second for validation videos. Use this to generate validation videos at a different FPS than the training `frame_rate`.

`validation_resolution`

Type: `"low" | "medium" | "high"` Default: `"high"`

Resolution preset for validation videos.

`validation_aspect_ratio`

Type: `"16:9" | "1:1" | "9:16"` Default: `"1:1"`

Aspect ratio for validation videos.

`stg_scale`

Type: `float` (0.0 - 3.0) Default: `1.0`

Spatio-Temporal Guidance scale for validation. Enhances video quality through attention perturbation.

`0.0`: Disable STG
`1.0`: Recommended balance
`>1.0`: Stronger guidance, may reduce diversity

How the Training Works

Pipeline Overview

Phase 1: Preprocessing

Download & extract training data archive
Detect media type (video/image) and audio presence
Optional: Split long videos into scenes (each scene inherits the original video's caption)
Resize and crop videos spatially (scale to fill target resolution, then center-crop to exact dimensions) and temporally (extract the first N frames at target FPS, discarding the rest)

Phase 2: Training

For each training step:

Load a batch of training samples (videos/images with their captions)
The model learns to generate videos matching the training data
With probability `first_frame_conditioning_p`, the model is trained to continue from a given first frame rather than generate from scratch
At regular intervals, validation videos are generated so you can monitor progress

Phase 3: Output

Save final LoRA weights (.safetensors)
Save config with trigger phrase
Combine validation videos with annotations

What Happens to Your Data

Archive handling: Your zip file is downloaded and fully extracted, including any nested zip files inside.

Caption matching: For each video or image, the trainer looks for a `.txt` file with the same name (e.g., `clip.mp4` pairs with `clip.txt`). If no caption file exists, an empty caption is used.

Scene splitting: When enabled, videos longer than `split_input_duration_threshold` are automatically split at scene boundaries. Each resulting clip uses the same caption as the original video. This means a 2-minute video with multiple scenes might become 5-10 separate training samples.

Video fitting: Each video is resized to fill the target resolution (maintaining aspect ratio), then center-cropped to the exact dimensions. Temporally, the video is resampled to the target FPS and the first `number_of_frames` frames are kept—anything beyond that is discarded. If your video is shorter than `number_of_frames` at the target FPS, it will be skipped unless `auto_scale_input` is enabled.

Trigger phrase: If specified, your trigger phrase is prepended to every caption before training (e.g., caption "A dog running" becomes "mystyle A dog running").

LoRA Training

Instead of modifying all 19 billion parameters in the model, LoRA training adds small adapter layers that learn your specific style or concept. The `rank` parameter controls the capacity of these adapters—higher rank means more learning capacity but also more risk of overfitting.

First-Frame Conditioning

When `first_frame_conditioning_p` is greater than 0, some training steps teach the model to continue a video from a given starting image rather than generating from scratch. This improves image-to-video quality when using your trained LoRA.

Audio-Video Joint Training

When audio training is enabled, the model learns to generate synchronized audio and video together. The same LoRA adapters affect both audio and video generation.

Audio preprocessing: Before training, audio is automatically processed to match the video:

Duration matching: Audio is stretched or compressed to match video duration. With `audio_preserve_pitch` enabled (default), this uses a phase vocoder to change speed without altering pitch—so voices still sound natural, just faster or slower.
Level normalization: With `audio_normalize` enabled (default), all audio tracks are normalized to a consistent peak level. This ensures the model learns from consistent audio regardless of the original recording levels.

Validation Sampling

At regular intervals during training, the trainer generates sample videos using your validation prompts. These samples show how the LoRA is progressing and are combined into a single output video with step numbers and prompts annotated. Use these to monitor for overfitting or underfitting.

Tips for Getting Good Results

Creating a Quality Dataset

Recommended dataset sizes:

Goal	Size
Style transfer	10-30 videos
Character/subject learning	10-30 videos
Complex concept	up to 100 videos

Video quality checklist:

Consistent style across all videos
Clear visibility of subjects
Good lighting (avoid extreme dark/bright)
Stable footage (minimize camera shake)
Relevant content only

Caption best practices:

Be descriptive but concise (1-3 sentences)
Describe what's visible: subjects, actions, setting, style
Include motion descriptions: "walking", "camera pans left"
Note visual style: "cinematic lighting", "anime style"
Use consistent terminology across captions

Good caption:


A golden retriever running through a sunlit meadow, wildflowers swaying.
The camera follows at a low angle. Warm golden hour lighting.

Poor caption:


dog running

Using trigger phrases:

For subject/object LoRAs (e.g., a specific character or product), use a unique token in your captions to represent the subject, and set it as your trigger phrase. For example, if training on videos of a specific person, use a made-up token like "sks" or "ohwx" in your captions: "A sks person walking down the street." This helps the model associate the learned appearance with a specific word you can use at inference time.

For style LoRAs (e.g., anime style, vintage film look), use a phrase that describes the style as your trigger, such as "in mystyle" or "vintage8mm style". Your captions would then be: "in mystyle A cat sitting on a windowsill" so the model learns to associate the visual style with that phrase.

Captions and scene splitting:

Be careful when using scene splitting with detailed captions. Since each scene inherits the original video's caption, the caption may no longer accurately describe the content after splitting. For example, if your caption says "A woman walks through a park, then sits on a bench" and the video is split into two scenes, both scenes will have this caption—but the first scene might only show walking, and the second only sitting. Consider using more general captions that apply to all scenes, or disable scene splitting if your captions are very specific.

Caption format at inference:

When using your trained LoRA, write prompts in the same style as your training captions. If your training captions were detailed and descriptive, use detailed prompts. If they were short, use short prompts. The model learns to associate your style/subject with the caption patterns it saw during training, so matching that format helps activate the LoRA effectively.

Recommended Starting Configuration

json
{
  "rank": 32,
  "number_of_steps": 2000,
  "learning_rate": 2e-4,
  "number_of_frames": 89,
  "resolution": "medium",
  "first_frame_conditioning_p": 0.5,
  "split_input_into_scenes": true,
  "audio_normalize": true,
  "audio_preserve_pitch": true
}

Diagnosing Issues

Overfitting signs:

Validation videos look exactly like training data
Artifacts or "copied" training frames

Solutions: Reduce steps, reduce rank, increase dataset diversity

Underfitting signs:

LoRA has minimal effect on generated videos
Style/subject not learned

Solutions: Increase steps, increase rank, improve captions

Training failures or slow training:

OOM crashes
Very slow training

Solutions: Reduce resolution to "low" or "medium", reduce frame count, reduce rank

Validation Prompt Tips

Write prompts that:

Use the trigger phrase
Test different scenarios
Mix simple and complex compositions

Example for a style LoRA:

json
[
  {"prompt": "in mystyle A woman walking through a city at night, neon reflections"},
  {"prompt": "in mystyle A bird flying over a mountain lake at sunrise"}
]

Audio Processing Tips

Dataset preparation for audio:

Ensure source videos have clear audio without excessive background noise
Audio should be properly synced with video content
Stereo audio works best (mono is automatically converted)

Understanding audio preprocessing:

Videos with varying durations are time-stretched to match the target frame count
With `audio_preserve_pitch=true`, a 3-second clip stretched to 4 seconds will sound 25% slower but at the same pitch
With `audio_normalize=true`, a quiet recording and a loud recording will both have similar peak levels after preprocessing

When to adjust defaults:

Disable `audio_normalize` if volume variations are intentional (e.g., ASMR content with whispers)
Disable `audio_preserve_pitch` only if you don't care about natural audio quality or if your clips already match the target duration exactly

Common Pitfalls

Too few samples: Less than 10 videos rarely works well
Inconsistent captions: Generic/random captions hurt learning
Mixed content: Unrelated videos dilute the learned concept
Too many steps: Watch validation—more steps can mean overfitting
Wrong aspect ratio: Train with the ratio you'll generate at
Ignoring validation: Always monitor validation outputs
Inconsistent audio levels: If your dataset has videos recorded at vastly different volumes, training may be inconsistent. Enable `audio_normalize` (default) to fix this.
Unnatural audio speed: If videos are resized/trimmed and `audio_preserve_pitch` is disabled, speech may sound chipmunk-like or have awkward silence. Keep it enabled (default) for natural-sounding audio.

fal-ai/ltx2-video-trainer

Input

Training history

Nothing here yet...

LTX-2 Trainer User Manual

Overview

Input Parameters Reference

Dataset Configuration

`training_data_url` (required)

LoRA Parameters

`rank`

`number_of_steps`

`learning_rate`

Video Configuration

`number_of_frames`

`frame_rate`

`resolution`

`aspect_ratio`

Audio Configuration

`with_audio`

`generate_audio_in_validation`

`audio_normalize`

`audio_preserve_pitch`

First Frame Conditioning

`first_frame_conditioning_p`

Trigger Phrase

`trigger_phrase`

Video Processing

`auto_scale_input`

`split_input_into_scenes`

`split_input_duration_threshold`

Validation Settings

`validation`

`validation_negative_prompt`

`validation_number_of_frames`

`validation_frame_rate`

`validation_resolution`

`validation_aspect_ratio`

`stg_scale`

How the Training Works

Pipeline Overview

What Happens to Your Data

LoRA Training

First-Frame Conditioning

Audio-Video Joint Training

Validation Sampling

Tips for Getting Good Results

Creating a Quality Dataset

Recommended Starting Configuration

Diagnosing Issues

Validation Prompt Tips

Audio Processing Tips

Common Pitfalls