LTX 2.3 Trainer (V2) - Text-to-Video (Training) API on fal

LTX 2.3 Trainer — Text-to-Video (`/t2v`)

Overview

The `/t2v` endpoint trains a LoRA adapter for the LTX 2.3 video model on your own clips, so the model learns a new subject, character, object, or visual style that you can then summon at inference time with a text prompt. There is no conditioning asset — the model generates a full video from text alone.

Key features:

Learns a subject, object, or style from a small set of your own videos (or images).
Optional joint audio training: if your clips have sound, the LoRA can learn the matching soundtrack too. Audio is auto-detected (audio is used only if every clip has a track) by default.
Trains on videos or on still images.
Optional trigger phrase to activate the learned concept on demand.
Validation previews generated during training so you can watch the LoRA take shape.

Dataset Format

Provide a single `.zip` archive (linked via `training_data_url`) containing your clips and, optionally, captions:

Videos: `.mp4`, `.mov`, `.avi`, `.mkv`
Images: `.png`, `.jpg`, `.jpeg`
Captions: a `.txt` file with the same base name as each media file (e.g. `clip01.mp4` + `clip01.txt`). Captions are optional but strongly recommended.

The archive must contain only videos OR only images — mixed datasets are rejected. Aim for at least 10 files; more is generally better. Files in subfolders are fine — clips with the same name in different subfolders are kept distinct automatically.

Minimum clip length: with `auto_scale_input` off (the default), each video must already have at least `number_of_frames` frames (default 89, ≈ 3.7 s at 24 fps). Shorter clips are silently skipped, and if every clip is too short the request fails (422, "All training videos are too short to be trainable"). Turn on `auto_scale_input` to resample shorter clips to the target frame count instead. This applies to video datasets only; images are single-frame and always usable.

Input Parameters Reference

Dataset

`training_data_url` (required)

Type: `string`

URL to the `.zip` archive of training clips (and optional captions). See "Dataset Format" above.

`trigger_phrase`

Type: `string` Default: `""`

A phrase prepended to every caption during training. At inference, including this phrase activates the learned concept. Leave empty when teaching a general style you always want applied.

Training Parameters

`rank`

Type: `integer` (one of `8`, `16`, `32`, `64`, `128`) Default: `32`

LoRA capacity. Higher values can capture more detail but use more memory and are more prone to overfitting on small datasets.

Value	Use Case
8–16	Small datasets, subtle styles, lower risk of overfitting
32	Balanced default
64–128	Larger datasets or complex subjects with lots of variation

`number_of_steps`

Type: `integer` Default: `2000` (range `100`–`20000`, in steps of 100)

How many optimization steps to run. More steps means more learning but also more time and a higher chance of overfitting.

`learning_rate`

Type: `number` Default: `0.0002`

How aggressively the model updates each step. The default is a sensible starting point; raise it cautiously and lower it if results look unstable or degraded.

Video Configuration

`number_of_frames`

Type: `integer` Default: `89` (range `9`–`121`)

Frames per training clip. Must satisfy `frames % 8 == 1` (e.g. 9, 17, 25, 33, 89, 121); other values are snapped down to the nearest valid count automatically. Has no effect on image datasets.

`frame_rate`

Type: `integer` Default: `24` (range `8`–`60`)

Target frames per second for training clips. LTX 2.3's native rate is 24.

`resolution`

Type: `string` (one of `low`, `medium`, `high`) Default: `medium`

Training resolution bucket. Combined with `aspect_ratio` this picks the exact pixel size:

Resolution	16:9	1:1	9:16
low	512×288	512×512	288×512
medium	768×448	768×768	448×768
high	960×544	960×960	544×960

`aspect_ratio`

Type: `string` (one of `16:9`, `1:1`, `9:16`) Default: `1:1`

Aspect ratio for training clips. See the table above.

`auto_scale_input`

Type: `boolean` Default: `false`

When true, videos are automatically fit to the target frame count and frame rate. No effect on image datasets.

`split_input_into_scenes`

Type: `boolean` Default: `true`

When true, videos longer than the duration threshold are automatically split into separate scenes (shots) before training.

`split_input_duration_threshold`

Type: `number` Default: `30.0` (range `1.0`–`60.0`)

Videos longer than this many seconds are eligible for scene splitting.

Audio Configuration

`with_audio`

Type: `boolean` or null Default: `null` (auto-detect)

Controls joint audio-video training, all-or-nothing across the dataset. With `null` (default, auto-detect), audio is enabled only when every clip has an audio track — if even one clip is silent, the whole run trains video-only. Set `true` to force audio training (the request fails with a 422 if any clip lacks an audio track), or `false` to ignore audio even when present.

`audio_normalize`

Type: `boolean` Default: `true`

Peak-normalize audio for consistent loudness across the dataset.

`audio_preserve_pitch`

Type: `boolean` Default: `true`

When fitting audio to video duration, keep the original pitch (time-stretch) instead of trimming or padding.

Validation

Validation generates preview videos at intervals during training so you can monitor progress. Previews are single-stage approximations and will not perfectly match final production-quality inference.

`validation`

Type: `array` Default: `[]` (max 2 entries)

A list of validation prompts. Each entry is an object with:

`prompt` (`string`) — the text prompt to preview.

`validation_negative_prompt`

Type: `string` Default: a long built-in quality negative prompt.

Negative prompt applied to all validation previews.

`validation_number_of_frames`

Type: `integer` Default: `89` (range `9`–`121`)

Frame count for validation videos. Snapped to `frames % 8 == 1`.

`validation_frame_rate`

Type: `integer` Default: `24` (range `8`–`60`)

Frame rate for validation videos.

`validation_resolution`

Type: `string` Default: `high`

Resolution bucket for validation videos (same buckets as `resolution`).

`validation_aspect_ratio`

Type: `string` Default: `1:1`

Aspect ratio for validation videos.

`stg_scale`

Type: `number` Default: `1.0` (range `0.0`–`3.0`)

Spatio-Temporal Guidance scale for validation previews. `0.0` disables it; `1.0` is recommended.

Outputs

`lora_file` — the trained LoRA weights (`.safetensors`). This is the main artifact.
`config_file` — a small JSON describing the trigger phrase and training type, for setting up inference.
`video` — a combined preview reel of all validation samples (when validation prompts were provided).
`debug_dataset` — a downloadable archive of your preprocessed data, only present when `debug_dataset` is enabled.

`debug_dataset`

Type: `boolean` Default: `false`

When enabled, returns an archive of the preprocessed training data so you can verify your videos, images, and captions were processed correctly before committing to a longer run.

Billing

A successful run is billed `max(100, number_of_steps)` billable units. Requests that fail before training completes (input-validation errors / HTTP 422, or dataset-download failures) are billed 0 units.

How the Training Works

Pipeline Overview

Preprocessing
1. Your archive is downloaded and extracted.
2. Media is matched to captions by file name.
3. Each clip is resized and cropped to the chosen resolution bucket and (optionally) split into scenes.
4. Audio is detected (or you force it on/off) and prepared.
Training — the LoRA is trained for `number_of_steps`, running validation previews at intervals.
Output — the final LoRA, config, and validation reel are uploaded.

What Happens to Your Data

Archive extraction: the `.zip` is unpacked; macOS `__MACOSX` metadata folders are ignored.
File matching: each media file is paired with the `.txt` of the same base name. Files without a caption train on an empty caption (or just the trigger phrase, if set).
Video fitting: each clip is resized to fill the target resolution while keeping aspect ratio, then center-cropped to the exact bucket dimensions. With `auto_scale_input`, clips are resampled to the target frame rate and frame count.
Scene splitting: with `split_input_into_scenes`, clips longer than the threshold are cut into separate shots, each becoming its own training sample.
Captions: the trigger phrase (if any) is prepended to each caption.
Audio: when audio is enabled, each clip's soundtrack is prepared (normalized and pitch-fit as configured) and learned jointly with the video.

LoRA Training

A LoRA is a small set of adapter weights layered on top of the frozen base model. Training only updates these adapters, so the result is a compact file you load alongside the base LTX 2.3 model at inference. The base model's general capabilities are preserved; the LoRA nudges it toward your subject or style.

Audio Support

If all your videos have sound and audio is enabled, the LoRA learns the audio-video relationship, so generated videos can come with a matching soundtrack. Audio is all-or-nothing across the dataset: if any clip is silent, the whole run trains video-only. Leave `with_audio` at its default to let the trainer enable audio only when every clip has a track.

Tips for Getting Good Results

Dataset Quality

Use at least 10 clips; 20–50 varied clips often work better for a robust concept.
Keep quality high: sharp, well-lit, representative footage. The LoRA reproduces whatever artifacts are common in your data.
For a subject/character/object, show it from multiple angles and in different contexts.
For a style, include varied content all sharing the same look.

Caption Best Practices

Describe what is actually in each clip, plainly and specifically.
Subject/object training: caption the subject with your trigger phrase plus a plain description, e.g. `tronl0g0 a glowing blue logo spinning on a desk`.
Style training: describe the content and let the style be learned implicitly; optionally use a style trigger phrase.

Good caption: `a red sports car drives along a coastal highway at sunset` Weak caption: `car` (too sparse to anchor the concept)

Trigger Phrases

For a specific subject/object/character, pick a rare, distinctive trigger token (e.g. `tronlog0`) so it does not collide with words the model already knows, and include it in every caption.
For an always-on style, you can skip the trigger phrase entirely.

Scene Splitting and Captions

If `split_input_into_scenes` is on, one long video becomes several shorter clips that all share the original caption. If different parts of the video show different things, the shared caption may not describe each split accurately. For precise captions, pre-split your clips and disable scene splitting.

Inference Format Matching

Prompt the trained LoRA the same way you captioned it. If you trained with a trigger phrase, include that phrase at inference. If your captions were short and descriptive, short descriptive prompts will behave most predictably.

Recommended Starting Configuration

json
{
  "training_data_url": "https://example.com/my_dataset.zip",
  "trigger_phrase": "tronlog0",
  "rank": 32,
  "number_of_steps": 2000,
  "learning_rate": 0.0002,
  "number_of_frames": 89,
  "frame_rate": 24,
  "resolution": "medium",
  "aspect_ratio": "1:1",
  "validation": [
    { "prompt": "tronlog0 spinning slowly on a dark background" }
  ]
}

Diagnosing Issues

Overfitting (validation previews look like exact copies of training clips, or ignore the prompt): reduce `number_of_steps`, lower `rank`, or add more varied data.
Underfitting (the concept barely appears): increase `number_of_steps`, raise `rank`, or improve caption quality.
No effect at all: confirm the trigger phrase is in your captions and used in validation prompts.
Training fails on the dataset: check that the archive contains only videos or only images, and that captions match media base names.

Validation Prompt Tips

Use prompts in the same style as your captions, including the trigger phrase.
Pick prompts that exercise what you care about (e.g. the subject in a new setting) so you can judge generalization, not memorization.

Common Pitfalls

Mixing images and videos in one archive (not supported).
Expecting audio when your clips are silent.
Forgetting the trigger phrase at inference.

fal-ai/ltx23-trainer-v2/t2v

Input

Training history

Nothing here yet...

LTX 2.3 Trainer — Text-to-Video (`/t2v`)

Overview

Dataset Format

Input Parameters Reference

Dataset

`training_data_url` (required)

`trigger_phrase`

Training Parameters

`rank`

`number_of_steps`

`learning_rate`

Video Configuration

`number_of_frames`

`frame_rate`

`resolution`

`aspect_ratio`

`auto_scale_input`

`split_input_into_scenes`

`split_input_duration_threshold`

Audio Configuration

`with_audio`

`audio_normalize`

`audio_preserve_pitch`

Validation

`validation`

`validation_negative_prompt`

`validation_number_of_frames`

`validation_frame_rate`

`validation_resolution`

`validation_aspect_ratio`

`stg_scale`

Outputs

`debug_dataset`

Billing

How the Training Works

Pipeline Overview

What Happens to Your Data

LoRA Training

Audio Support

Tips for Getting Good Results

Dataset Quality

Caption Best Practices

Trigger Phrases

Scene Splitting and Captions

Inference Format Matching

Recommended Starting Configuration

Diagnosing Issues

Validation Prompt Tips

Common Pitfalls