fal-ai/ltx23-trainer-v2/extend-prefix

Train a LoRA that continues a video forward in time — supply an opening clip at inference and the model generates what comes next.
Training
Commercial use

Input

Additional Settings

Customize your input with more control.

The cost of training depends on the number of steps. The formula is: 0.0062 * steps. With 1000 steps, your request will cost $6.20.

Training history

Note: these are the most recent training requests. For the full history, check the requests tab.

LTX 2.3 Trainer — Forward Video Extension (`/extend-prefix`)

Overview

The `/extend-prefix` endpoint trains a LoRA for the LTX 2.3 model that continues a video forward in time. During training, the first N frames of each clip are kept as a clean "prefix" and the model learns to generate everything that follows. At inference you supply an opening clip and the model produces the continuation in the learned style.

Key features:

  • Learns to extend video forward from a clean opening (prefix) window.
  • Optional joint audio extension: when enabled, audio is continued in sync with the video from the same opening window.
  • Trains on plain videos (plus optional captions) — the prefix is carved from each clip automatically.
  • Validation previews continue a supplied clip forward.

Dataset Format

Provide a single `.zip` archive (linked via `training_data_url`) of plain videos:

  • Videos: `.mp4`, `.mov`, `.avi`, `.mkv`
  • Captions: a `.txt` with the same base name as each video (optional but recommended).

Images are rejected (there is nothing to extend over time on a still). If you enable audio extension, every training clip — and every validation clip — must contain an audio track. Aim for at least 10 clips. Files in subfolders are fine — clips with the same name in different subfolders are kept distinct automatically.

Minimum clip length: with `auto_scale_input` off (the default), each video must already have at least `number_of_frames` frames (default 89, ≈ 3.7 s at 24 fps). Shorter clips are silently skipped, and if every clip is too short the request fails (422, "All training videos are too short to be trainable"). Turn on `auto_scale_input` to resample shorter clips to the target frame count instead.

With `with_audio` enabled, audio is required on every clip. Joint audio extension requires every training clip and every validation clip to contain an audio track; if any lacks one, the request is rejected (HTTP 422). With `with_audio` off, audio is ignored and the output is silent.

Input Parameters Reference

Dataset
`training_data_url` (required)

Type: `string`

URL to the `.zip` archive of videos.

`trigger_phrase`

Type: `string` Default: `""`

Phrase prepended to captions during training; include it at inference.

Training Parameters
`rank`

Type: `integer` (`8`, `16`, `32`, `64`, `128`) Default: `32`

LoRA capacity.

`number_of_steps`

Type: `integer` Default: `2000` (range `100``20000`)

Number of optimization steps.

`learning_rate`

Type: `number` Default: `0.0002`

Optimization step size.

`conditioning_frames`

Type: `integer` Default: `9` (range `9``121`)

Number of leading frames kept as the clean prefix the model continues from. Must be `≡ 1 (mod 8)` (e.g. 9, 17, 25). It must be short enough that there is a continuation left to generate — i.e. the prefix cannot cover the whole `number_of_frames` (or `validation_number_of_frames`) clip.

ValueBehavior
9Short prefix — the model generates most of the clip (default)
17–25Longer context window before the generated continuation
Video Configuration
`number_of_frames`

Type: `integer` Default: `89` (range `9``121`)

Frames per training clip. Must satisfy `frames % 8 == 1` (other values are snapped down to the nearest valid count) and must be larger than the prefix window.

`frame_rate`

Type: `integer` Default: `24` (range `8``60`)

Target frames per second.

`resolution`

Type: `string` (`low`, `medium`, `high`) Default: `medium`

Resolution16:91:19:16
low512×288512×512288×512
medium768×448768×768448×768
high960×544960×960544×960
`aspect_ratio`

Type: `string` (`16:9`, `1:1`, `9:16`) Default: `1:1`

`auto_scale_input`

Type: `boolean` Default: `false`

Fit videos to the target frame count and frame rate.

`split_input_into_scenes`

Type: `boolean` Default: `true`

Split long clips into scenes before training.

`split_input_duration_threshold`

Type: `number` Default: `30.0` (range `1.0``60.0`)

Duration above which a clip is eligible for scene splitting.

Audio Configuration
`with_audio`

Type: `boolean` Default: `false`

Set `true` to jointly extend audio and video — the model continues both from the same opening window so they stay in sync. Requires every training clip and every validation clip to contain an audio track. Default (`false`) produces a video-only, silent extension.

`audio_normalize`

Type: `boolean` Default: `true`

Peak-normalize audio for consistent loudness (used when audio extension is enabled).

`audio_preserve_pitch`

Type: `boolean` Default: `true`

Preserve pitch when fitting audio to video duration.

Validation
`validation`

Type: `array` Default: `[]` (max 2 entries)

Validation samples, each an object with:

  • `prompt` (`string`) — the text prompt.
  • `video_url` (`string`, required) — a video to extend forward. Its first `conditioning_frames` frames are used as the prefix.

The validation clip must be at least `conditioning_frames` long (at the validation frame rate).

`validation_negative_prompt`

Type: `string` Default: a built-in quality negative prompt.

`validation_number_of_frames`

Type: `integer` Default: `89` (range `9``121`)

Must be larger than the prefix window so there is a continuation to preview.

`validation_frame_rate`

Type: `integer` Default: `24` (range `8``60`)

`validation_resolution`

Type: `string` Default: `high`

`validation_aspect_ratio`

Type: `string` Default: `1:1`

`stg_scale`

Type: `number` Default: `1.0` (range `0.0``3.0`)

`debug_dataset`

Type: `boolean` Default: `false`

Return an archive of the preprocessed data for inspection.

Outputs

  • `lora_file` — the trained LoRA weights (`.safetensors`).
  • `config_file` — JSON describing the trigger phrase and training type.
  • `video` — combined validation reel (when validation samples were provided).
  • `debug_dataset` — preprocessed-data archive, only when `debug_dataset` is enabled.

Billing

A successful run is billed `max(100, number_of_steps)` billable units. Requests that fail before training completes (input-validation errors / HTTP 422, or dataset-download failures) are billed 0 units.

How the Training Works

Pipeline Overview
  1. Preprocessing — the archive is extracted, clips fit to the resolution bucket (optionally scene-split), and audio prepared when audio extension is enabled.
  2. Training — for each clip, the first `conditioning_frames` frames are held clean as the prefix and the model learns to generate the rest. With audio extension on, the matching opening audio window conditions the audio so it continues in sync. Validation previews run at intervals.
  3. Output — the LoRA, config, and validation reel are uploaded.
What Happens to Your Data
  • Archive extraction: the `.zip` is unpacked; macOS `__MACOSX` metadata folders are ignored.
  • Video fitting: clips are resized to fill the resolution bucket and center-cropped; with `auto_scale_input` they are resampled to the target frame rate/count.
  • Prefix carving: the leading `conditioning_frames` frames of each clip are kept clean as conditioning; the remainder is the generation target. This happens automatically — you do not need to mark anything.
  • Audio window: when audio extension is on, the matching opening seconds of audio are used as the audio prefix so audio and video stay aligned.
  • Captions: the trigger phrase (if set) is prepended.

Tips for Getting Good Results

Dataset Quality
  • Use at least 10 clips that contain the kind of forward motion/continuation you want learned.
  • Clips should be long enough that there is a meaningful continuation beyond the prefix window.
  • For audio extension, ensure every clip carries clean, in-sync audio.
Caption Best Practices
  • Describe the action/continuation plainly, optionally with a trigger phrase.
  • Keep captions consistent across clips.

Good caption: `a car accelerating down a straight road` Weak caption: `car`

Trigger Phrases
  • Use a distinctive trigger phrase to invoke a particular continuation style; include it in every caption and at inference.
Inference Format Matching

At inference, supply an opening clip at least `conditioning_frames` long, use the same caption style and trigger phrase, and match the audio setting (audio-on extension expects an audio track).

json
{
  "training_data_url": "https://example.com/clips.zip",
  "trigger_phrase": "",
  "rank": 32,
  "number_of_steps": 2000,
  "learning_rate": 0.0002,
  "conditioning_frames": 9,
  "number_of_frames": 89,
  "frame_rate": 24,
  "resolution": "medium",
  "aspect_ratio": "1:1",
  "with_audio": false,
  "validation": [
    { "prompt": "a car accelerating down a road", "video_url": "https://example.com/opening.mp4" }
  ]
}
Diagnosing Issues
  • Continuation drifts or stops matching the prefix: add more representative clips; try a slightly longer `conditioning_frames` for more context.
  • Validation rejected for short clip: provide a validation clip at least `conditioning_frames` long, or lower `conditioning_frames`.
  • Audio extension rejected: when `with_audio` is true, every training and validation clip must have an audio track.
  • Overfitting: fewer steps, lower `rank`, more clips.
Validation Prompt Tips
  • Use a fresh opening clip to gauge generalization.
  • Keep the prompt aligned with the continuation you expect.
Common Pitfalls
  • `conditioning_frames` not `≡ 1 (mod 8)` (use 9, 17, 25, ...).
  • A prefix window so long it covers the whole clip (nothing left to generate).
  • Enabling audio extension with silent clips.