LTX 2.3 Trainer (V2) - Backward Video Extension (Training) API on fal

LTX 2.3 Trainer — Backward Video Extension (`/extend-suffix`)

Overview

The `/extend-suffix` endpoint trains a LoRA for the LTX 2.3 model that generates the lead-in to a video — it extends a clip backward in time. During training, the last N frames of each clip are kept as a clean "suffix" and the model learns to generate the frames that lead up to them. At inference you supply a closing clip and the model produces the preceding section.

Key features:

Learns to extend video backward, generating a plausible lead-in to a clean closing (suffix) window.
Optional joint audio extension: when enabled, audio is generated in sync with the video from the same closing window.
Trains on plain videos (plus optional captions) — the suffix is carved from each clip automatically.
Validation previews extend a supplied clip backward.

Dataset Format

Provide a single `.zip` archive (linked via `training_data_url`) of plain videos:

Videos: `.mp4`, `.mov`, `.avi`, `.mkv`
Captions: a `.txt` with the same base name as each video (optional but recommended).

Images are rejected (there is nothing to extend over time on a still). If you enable audio extension, every training clip — and every validation clip — must contain an audio track. Aim for at least 10 clips. Files in subfolders are fine — clips with the same name in different subfolders are kept distinct automatically.

Minimum clip length: with `auto_scale_input` off (the default), each video must already have at least `number_of_frames` frames (default 89, ≈ 3.7 s at 24 fps). Shorter clips are silently skipped, and if every clip is too short the request fails (422, "All training videos are too short to be trainable"). Turn on `auto_scale_input` to resample shorter clips to the target frame count instead.

With `with_audio` enabled, audio is required on every clip. Joint audio extension requires every training clip and every validation clip to contain an audio track; if any lacks one, the request is rejected (HTTP 422). With `with_audio` off, audio is ignored and the output is silent.

Input Parameters Reference

Dataset

`training_data_url` (required)

Type: `string`

URL to the `.zip` archive of videos.

`trigger_phrase`

Type: `string` Default: `""`

Phrase prepended to captions during training; include it at inference.

Training Parameters

`rank`

Type: `integer` (`8`, `16`, `32`, `64`, `128`) Default: `32`

LoRA capacity.

`number_of_steps`

Type: `integer` Default: `2000` (range `100`–`20000`)

Number of optimization steps.

`learning_rate`

Type: `number` Default: `0.0002`

Optimization step size.

`conditioning_frames`

Type: `integer` Default: `8` (range `8`–`121`)

Number of trailing frames kept as the clean suffix the model leads up to. Must be `≡ 0 (mod 8)` (e.g. 8, 16, 24). It must be short enough that there is a lead-in left to generate — the suffix cannot cover the whole `number_of_frames` (or `validation_number_of_frames`) clip.

Value	Behavior
8	Short suffix — the model generates most of the lead-in (default)
16–24	Longer closing context before the generated lead-in

Video Configuration

`number_of_frames`

Type: `integer` Default: `89` (range `9`–`121`)

Frames per training clip. Must satisfy `frames % 8 == 1` (other values are snapped down to the nearest valid count) and must be larger than the suffix window.

`frame_rate`

Type: `integer` Default: `24` (range `8`–`60`)

Target frames per second.

`resolution`

Type: `string` (`low`, `medium`, `high`) Default: `medium`

Resolution	16:9	1:1	9:16
low	512×288	512×512	288×512
medium	768×448	768×768	448×768
high	960×544	960×960	544×960

`aspect_ratio`

Type: `string` (`16:9`, `1:1`, `9:16`) Default: `1:1`

`auto_scale_input`

Type: `boolean` Default: `false`

Fit videos to the target frame count and frame rate.

`split_input_into_scenes`

Type: `boolean` Default: `true`

Split long clips into scenes before training.

`split_input_duration_threshold`

Type: `number` Default: `30.0` (range `1.0`–`60.0`)

Duration above which a clip is eligible for scene splitting.

Audio Configuration

`with_audio`

Type: `boolean` Default: `false`

Set `true` to jointly extend audio and video — the model generates both leading up to the same closing window so they stay in sync. Requires every training clip and every validation clip to contain an audio track. Default (`false`) produces a video-only, silent extension.

`audio_normalize`

Type: `boolean` Default: `true`

Peak-normalize audio for consistent loudness (used when audio extension is enabled).

`audio_preserve_pitch`

Type: `boolean` Default: `true`

Preserve pitch when fitting audio to video duration.

Validation

`validation`

Type: `array` Default: `[]` (max 2 entries)

Validation samples, each an object with:

`prompt` (`string`) — the text prompt.
`video_url` (`string`, required) — a video to extend backward. Its last `conditioning_frames` frames are used as the suffix.

The validation clip must be at least `conditioning_frames` long (at the validation frame rate).

`validation_negative_prompt`

Type: `string` Default: a built-in quality negative prompt.

`validation_number_of_frames`

Type: `integer` Default: `89` (range `9`–`121`)

Must be larger than the suffix window so there is a lead-in to preview.

`validation_frame_rate`

Type: `integer` Default: `24` (range `8`–`60`)

`validation_resolution`

Type: `string` Default: `high`

`validation_aspect_ratio`

Type: `string` Default: `1:1`

`stg_scale`

Type: `number` Default: `1.0` (range `0.0`–`3.0`)

`debug_dataset`

Type: `boolean` Default: `false`

Return an archive of the preprocessed data for inspection.

Outputs

`lora_file` — the trained LoRA weights (`.safetensors`).
`config_file` — JSON describing the trigger phrase and training type.
`video` — combined validation reel (when validation samples were provided).
`debug_dataset` — preprocessed-data archive, only when `debug_dataset` is enabled.

Billing

A successful run is billed `max(100, number_of_steps)` billable units. Requests that fail before training completes (input-validation errors / HTTP 422, or dataset-download failures) are billed 0 units.

How the Training Works

Pipeline Overview

Preprocessing — the archive is extracted, clips fit to the resolution bucket (optionally scene-split), and audio prepared when audio extension is enabled.
Training — for each clip, the last `conditioning_frames` frames are held clean as the suffix and the model learns to generate the preceding lead-in. With audio extension on, the matching closing audio window conditions the audio so it stays in sync. Validation previews run at intervals.
Output — the LoRA, config, and validation reel are uploaded.

What Happens to Your Data

Archive extraction: the `.zip` is unpacked; macOS `__MACOSX` metadata folders are ignored.
Video fitting: clips are resized to fill the resolution bucket and center-cropped; with `auto_scale_input` they are resampled to the target frame rate/count.
Suffix carving: the trailing `conditioning_frames` frames of each clip are kept clean as conditioning; everything before is the generation target. This happens automatically.
Audio window: when audio extension is on, the matching closing seconds of audio condition the audio so audio and video stay aligned.
Captions: the trigger phrase (if set) is prepended.

Tips for Getting Good Results

Dataset Quality

Use at least 10 clips that contain the kind of backward continuation (lead-in) you want learned.
Clips should be long enough that there is a meaningful lead-in beyond the suffix window.
For audio extension, ensure every clip carries clean, in-sync audio.

Caption Best Practices

Describe the action plainly, optionally with a trigger phrase.
Keep captions consistent across clips.

Good caption: `a person walking up to and opening a door` Weak caption: `door`

Trigger Phrases

Use a distinctive trigger phrase to invoke a particular lead-in style; include it in every caption and at inference.

Inference Format Matching

At inference, supply a closing clip at least `conditioning_frames` long, use the same caption style and trigger phrase, and match the audio setting.

Recommended Starting Configuration

json
{
  "training_data_url": "https://example.com/clips.zip",
  "trigger_phrase": "",
  "rank": 32,
  "number_of_steps": 2000,
  "learning_rate": 0.0002,
  "conditioning_frames": 8,
  "number_of_frames": 89,
  "frame_rate": 24,
  "resolution": "medium",
  "aspect_ratio": "1:1",
  "with_audio": false,
  "validation": [
    { "prompt": "a person approaching a door", "video_url": "https://example.com/closing.mp4" }
  ]
}

Diagnosing Issues

Lead-in drifts or does not flow into the suffix: add more representative clips; try a slightly longer `conditioning_frames` for more closing context.
Validation rejected for short clip: provide a validation clip at least `conditioning_frames` long, or lower `conditioning_frames`.
Audio extension rejected: when `with_audio` is true, every training and validation clip must have an audio track.
Overfitting: fewer steps, lower `rank`, more clips.

Validation Prompt Tips

Use a fresh closing clip to gauge generalization.
Keep the prompt aligned with the lead-in you expect.

Common Pitfalls

`conditioning_frames` not a multiple of 8 (use 8, 16, 24, ...).
A suffix window so long it covers the whole clip (nothing left to generate).
Enabling audio extension with silent clips.

fal-ai/ltx23-trainer-v2/extend-suffix

Input

Training history

Nothing here yet...

LTX 2.3 Trainer — Backward Video Extension (`/extend-suffix`)

Overview

Dataset Format

Input Parameters Reference

Dataset

`training_data_url` (required)

`trigger_phrase`

Training Parameters

`rank`

`number_of_steps`

`learning_rate`

`conditioning_frames`

Video Configuration

`number_of_frames`

`frame_rate`

`resolution`

`aspect_ratio`

`auto_scale_input`

`split_input_into_scenes`

`split_input_duration_threshold`

Audio Configuration

`with_audio`

`audio_normalize`

`audio_preserve_pitch`

Validation

`validation`

`validation_negative_prompt`

`validation_number_of_frames`

`validation_frame_rate`

`validation_resolution`

`validation_aspect_ratio`

`stg_scale`

`debug_dataset`

Outputs

Billing

How the Training Works

Pipeline Overview

What Happens to Your Data

Tips for Getting Good Results

Dataset Quality

Caption Best Practices

Trigger Phrases

Inference Format Matching

Recommended Starting Configuration

Diagnosing Issues

Validation Prompt Tips

Common Pitfalls