LTX 2.3 Trainer (V2) - Forward Video Extension (Training) API on fal

LTX 2.3 Trainer — Forward Video Extension (`/extend-prefix`)

Overview

The `/extend-prefix` endpoint trains a LoRA for the LTX 2.3 model that continues a video forward in time. During training, the first N frames of each clip are kept as a clean "prefix" and the model learns to generate everything that follows. At inference you supply an opening clip and the model produces the continuation in the learned style.

Key features:

Learns to extend video forward from a clean opening (prefix) window.
Optional joint audio extension: when enabled, audio is continued in sync with the video from the same opening window.
Trains on plain videos (plus optional captions) — the prefix is carved from each clip automatically.
Validation previews continue a supplied clip forward.

Dataset Format

Provide a single `.zip` archive (linked via `training_data_url`) of plain videos:

Videos: `.mp4`, `.mov`, `.avi`, `.mkv`
Captions: a `.txt` with the same base name as each video (optional but recommended).

Images are rejected (there is nothing to extend over time on a still). If you enable audio extension, every training clip — and every validation clip — must contain an audio track. Aim for at least 10 clips. Files in subfolders are fine — clips with the same name in different subfolders are kept distinct automatically.

Minimum clip length: with `auto_scale_input` off (the default), each video must already have at least `number_of_frames` frames (default 89, ≈ 3.7 s at 24 fps). Shorter clips are silently skipped, and if every clip is too short the request fails (422, "All training videos are too short to be trainable"). Turn on `auto_scale_input` to resample shorter clips to the target frame count instead.

With `with_audio` enabled, audio is required on every clip. Joint audio extension requires every training clip and every validation clip to contain an audio track; if any lacks one, the request is rejected (HTTP 422). With `with_audio` off, audio is ignored and the output is silent.

Input Parameters Reference

Dataset

`training_data_url` (required)

Type: `string`

URL to the `.zip` archive of videos.

`trigger_phrase`

Type: `string` Default: `""`

Phrase prepended to captions during training; include it at inference.

Training Parameters

`rank`

Type: `integer` (`8`, `16`, `32`, `64`, `128`) Default: `32`

LoRA capacity.

`number_of_steps`

Type: `integer` Default: `2000` (range `100`–`20000`)

Number of optimization steps.

`learning_rate`

Type: `number` Default: `0.0002`

Optimization step size.

`conditioning_frames`

Type: `integer` Default: `9` (range `9`–`121`)

Number of leading frames kept as the clean prefix the model continues from. Must be `≡ 1 (mod 8)` (e.g. 9, 17, 25). It must be short enough that there is a continuation left to generate — i.e. the prefix cannot cover the whole `number_of_frames` (or `validation_number_of_frames`) clip.

Value	Behavior
9	Short prefix — the model generates most of the clip (default)
17–25	Longer context window before the generated continuation

Video Configuration

`number_of_frames`

Type: `integer` Default: `89` (range `9`–`121`)

Frames per training clip. Must satisfy `frames % 8 == 1` (other values are snapped down to the nearest valid count) and must be larger than the prefix window.

`frame_rate`

Type: `integer` Default: `24` (range `8`–`60`)

Target frames per second.

`resolution`

Type: `string` (`low`, `medium`, `high`) Default: `medium`

Resolution	16:9	1:1	9:16
low	512×288	512×512	288×512
medium	768×448	768×768	448×768
high	960×544	960×960	544×960

`aspect_ratio`

Type: `string` (`16:9`, `1:1`, `9:16`) Default: `1:1`

`auto_scale_input`

Type: `boolean` Default: `false`

Fit videos to the target frame count and frame rate.

`split_input_into_scenes`

Type: `boolean` Default: `true`

Split long clips into scenes before training.

`split_input_duration_threshold`

Type: `number` Default: `30.0` (range `1.0`–`60.0`)

Duration above which a clip is eligible for scene splitting.

Audio Configuration

`with_audio`

Type: `boolean` Default: `false`

Set `true` to jointly extend audio and video — the model continues both from the same opening window so they stay in sync. Requires every training clip and every validation clip to contain an audio track. Default (`false`) produces a video-only, silent extension.

`audio_normalize`

Type: `boolean` Default: `true`

Peak-normalize audio for consistent loudness (used when audio extension is enabled).

`audio_preserve_pitch`

Type: `boolean` Default: `true`

Preserve pitch when fitting audio to video duration.

Validation

`validation`

Type: `array` Default: `[]` (max 2 entries)

Validation samples, each an object with:

`prompt` (`string`) — the text prompt.
`video_url` (`string`, required) — a video to extend forward. Its first `conditioning_frames` frames are used as the prefix.

The validation clip must be at least `conditioning_frames` long (at the validation frame rate).

`validation_negative_prompt`

Type: `string` Default: a built-in quality negative prompt.

`validation_number_of_frames`

Type: `integer` Default: `89` (range `9`–`121`)

Must be larger than the prefix window so there is a continuation to preview.

`validation_frame_rate`

Type: `integer` Default: `24` (range `8`–`60`)

`validation_resolution`

Type: `string` Default: `high`

`validation_aspect_ratio`

Type: `string` Default: `1:1`

`stg_scale`

Type: `number` Default: `1.0` (range `0.0`–`3.0`)

`debug_dataset`

Type: `boolean` Default: `false`

Return an archive of the preprocessed data for inspection.

Outputs

`lora_file` — the trained LoRA weights (`.safetensors`).
`config_file` — JSON describing the trigger phrase and training type.
`video` — combined validation reel (when validation samples were provided).
`debug_dataset` — preprocessed-data archive, only when `debug_dataset` is enabled.

Billing

A successful run is billed `max(100, number_of_steps)` billable units. Requests that fail before training completes (input-validation errors / HTTP 422, or dataset-download failures) are billed 0 units.

How the Training Works

Pipeline Overview

Preprocessing — the archive is extracted, clips fit to the resolution bucket (optionally scene-split), and audio prepared when audio extension is enabled.
Training — for each clip, the first `conditioning_frames` frames are held clean as the prefix and the model learns to generate the rest. With audio extension on, the matching opening audio window conditions the audio so it continues in sync. Validation previews run at intervals.
Output — the LoRA, config, and validation reel are uploaded.

What Happens to Your Data

Archive extraction: the `.zip` is unpacked; macOS `__MACOSX` metadata folders are ignored.
Video fitting: clips are resized to fill the resolution bucket and center-cropped; with `auto_scale_input` they are resampled to the target frame rate/count.
Prefix carving: the leading `conditioning_frames` frames of each clip are kept clean as conditioning; the remainder is the generation target. This happens automatically — you do not need to mark anything.
Audio window: when audio extension is on, the matching opening seconds of audio are used as the audio prefix so audio and video stay aligned.
Captions: the trigger phrase (if set) is prepended.

Tips for Getting Good Results

Dataset Quality

Use at least 10 clips that contain the kind of forward motion/continuation you want learned.
Clips should be long enough that there is a meaningful continuation beyond the prefix window.
For audio extension, ensure every clip carries clean, in-sync audio.

Caption Best Practices

Describe the action/continuation plainly, optionally with a trigger phrase.
Keep captions consistent across clips.

Good caption: `a car accelerating down a straight road` Weak caption: `car`

Trigger Phrases

Use a distinctive trigger phrase to invoke a particular continuation style; include it in every caption and at inference.

Inference Format Matching

At inference, supply an opening clip at least `conditioning_frames` long, use the same caption style and trigger phrase, and match the audio setting (audio-on extension expects an audio track).

Recommended Starting Configuration

json
{
  "training_data_url": "https://example.com/clips.zip",
  "trigger_phrase": "",
  "rank": 32,
  "number_of_steps": 2000,
  "learning_rate": 0.0002,
  "conditioning_frames": 9,
  "number_of_frames": 89,
  "frame_rate": 24,
  "resolution": "medium",
  "aspect_ratio": "1:1",
  "with_audio": false,
  "validation": [
    { "prompt": "a car accelerating down a road", "video_url": "https://example.com/opening.mp4" }
  ]
}

Diagnosing Issues

Continuation drifts or stops matching the prefix: add more representative clips; try a slightly longer `conditioning_frames` for more context.
Validation rejected for short clip: provide a validation clip at least `conditioning_frames` long, or lower `conditioning_frames`.
Audio extension rejected: when `with_audio` is true, every training and validation clip must have an audio track.
Overfitting: fewer steps, lower `rank`, more clips.

Validation Prompt Tips

Use a fresh opening clip to gauge generalization.
Keep the prompt aligned with the continuation you expect.

Common Pitfalls

`conditioning_frames` not `≡ 1 (mod 8)` (use 9, 17, 25, ...).
A prefix window so long it covers the whole clip (nothing left to generate).
Enabling audio extension with silent clips.

fal-ai/ltx23-trainer-v2/extend-prefix

Input

Training history

Nothing here yet...

LTX 2.3 Trainer — Forward Video Extension (`/extend-prefix`)

Overview

Dataset Format

Input Parameters Reference

Dataset

`training_data_url` (required)

`trigger_phrase`

Training Parameters

`rank`

`number_of_steps`

`learning_rate`

`conditioning_frames`

Video Configuration

`number_of_frames`

`frame_rate`

`resolution`

`aspect_ratio`

`auto_scale_input`

`split_input_into_scenes`

`split_input_duration_threshold`

Audio Configuration

`with_audio`

`audio_normalize`

`audio_preserve_pitch`

Validation

`validation`

`validation_negative_prompt`

`validation_number_of_frames`

`validation_frame_rate`

`validation_resolution`

`validation_aspect_ratio`

`stg_scale`

`debug_dataset`

Outputs

Billing

How the Training Works

Pipeline Overview

What Happens to Your Data

Tips for Getting Good Results

Dataset Quality

Caption Best Practices

Trigger Phrases

Inference Format Matching

Recommended Starting Configuration

Diagnosing Issues

Validation Prompt Tips

Common Pitfalls