fal-ai/ltx23-trainer-v2/extend-suffix

Train a LoRA that generates the lead-in to a video, extending a clip backward in time from its ending.
Training
Commercial use

Input

Additional Settings

Customize your input with more control.

The cost of training depends on the number of steps. The formula is: 0.0061 * steps. With 1000 steps, your request will cost $6.10.

Training history

Note: these are the most recent training requests. For the full history, check the requests tab.

LTX 2.3 Trainer — Backward Video Extension (`/extend-suffix`)

Overview

The `/extend-suffix` endpoint trains a LoRA for the LTX 2.3 model that generates the lead-in to a video — it extends a clip backward in time. During training, the last N frames of each clip are kept as a clean "suffix" and the model learns to generate the frames that lead up to them. At inference you supply a closing clip and the model produces the preceding section.

Key features:

  • Learns to extend video backward, generating a plausible lead-in to a clean closing (suffix) window.
  • Optional joint audio extension: when enabled, audio is generated in sync with the video from the same closing window.
  • Trains on plain videos (plus optional captions) — the suffix is carved from each clip automatically.
  • Validation previews extend a supplied clip backward.

Dataset Format

Provide a single `.zip` archive (linked via `training_data_url`) of plain videos:

  • Videos: `.mp4`, `.mov`, `.avi`, `.mkv`
  • Captions: a `.txt` with the same base name as each video (optional but recommended).

Images are rejected (there is nothing to extend over time on a still). If you enable audio extension, every training clip — and every validation clip — must contain an audio track. Aim for at least 10 clips. Files in subfolders are fine — clips with the same name in different subfolders are kept distinct automatically.

Minimum clip length: with `auto_scale_input` off (the default), each video must already have at least `number_of_frames` frames (default 89, ≈ 3.7 s at 24 fps). Shorter clips are silently skipped, and if every clip is too short the request fails (422, "All training videos are too short to be trainable"). Turn on `auto_scale_input` to resample shorter clips to the target frame count instead.

With `with_audio` enabled, audio is required on every clip. Joint audio extension requires every training clip and every validation clip to contain an audio track; if any lacks one, the request is rejected (HTTP 422). With `with_audio` off, audio is ignored and the output is silent.

Input Parameters Reference

Dataset
`training_data_url` (required)

Type: `string`

URL to the `.zip` archive of videos.

`trigger_phrase`

Type: `string` Default: `""`

Phrase prepended to captions during training; include it at inference.

Training Parameters
`rank`

Type: `integer` (`8`, `16`, `32`, `64`, `128`) Default: `32`

LoRA capacity.

`number_of_steps`

Type: `integer` Default: `2000` (range `100``20000`)

Number of optimization steps.

`learning_rate`

Type: `number` Default: `0.0002`

Optimization step size.

`conditioning_frames`

Type: `integer` Default: `8` (range `8``121`)

Number of trailing frames kept as the clean suffix the model leads up to. Must be `≡ 0 (mod 8)` (e.g. 8, 16, 24). It must be short enough that there is a lead-in left to generate — the suffix cannot cover the whole `number_of_frames` (or `validation_number_of_frames`) clip.

ValueBehavior
8Short suffix — the model generates most of the lead-in (default)
16–24Longer closing context before the generated lead-in
Video Configuration
`number_of_frames`

Type: `integer` Default: `89` (range `9``121`)

Frames per training clip. Must satisfy `frames % 8 == 1` (other values are snapped down to the nearest valid count) and must be larger than the suffix window.

`frame_rate`

Type: `integer` Default: `24` (range `8``60`)

Target frames per second.

`resolution`

Type: `string` (`low`, `medium`, `high`) Default: `medium`

Resolution16:91:19:16
low512×288512×512288×512
medium768×448768×768448×768
high960×544960×960544×960
`aspect_ratio`

Type: `string` (`16:9`, `1:1`, `9:16`) Default: `1:1`

`auto_scale_input`

Type: `boolean` Default: `false`

Fit videos to the target frame count and frame rate.

`split_input_into_scenes`

Type: `boolean` Default: `true`

Split long clips into scenes before training.

`split_input_duration_threshold`

Type: `number` Default: `30.0` (range `1.0``60.0`)

Duration above which a clip is eligible for scene splitting.

Audio Configuration
`with_audio`

Type: `boolean` Default: `false`

Set `true` to jointly extend audio and video — the model generates both leading up to the same closing window so they stay in sync. Requires every training clip and every validation clip to contain an audio track. Default (`false`) produces a video-only, silent extension.

`audio_normalize`

Type: `boolean` Default: `true`

Peak-normalize audio for consistent loudness (used when audio extension is enabled).

`audio_preserve_pitch`

Type: `boolean` Default: `true`

Preserve pitch when fitting audio to video duration.

Validation
`validation`

Type: `array` Default: `[]` (max 2 entries)

Validation samples, each an object with:

  • `prompt` (`string`) — the text prompt.
  • `video_url` (`string`, required) — a video to extend backward. Its last `conditioning_frames` frames are used as the suffix.

The validation clip must be at least `conditioning_frames` long (at the validation frame rate).

`validation_negative_prompt`

Type: `string` Default: a built-in quality negative prompt.

`validation_number_of_frames`

Type: `integer` Default: `89` (range `9``121`)

Must be larger than the suffix window so there is a lead-in to preview.

`validation_frame_rate`

Type: `integer` Default: `24` (range `8``60`)

`validation_resolution`

Type: `string` Default: `high`

`validation_aspect_ratio`

Type: `string` Default: `1:1`

`stg_scale`

Type: `number` Default: `1.0` (range `0.0``3.0`)

`debug_dataset`

Type: `boolean` Default: `false`

Return an archive of the preprocessed data for inspection.

Outputs

  • `lora_file` — the trained LoRA weights (`.safetensors`).
  • `config_file` — JSON describing the trigger phrase and training type.
  • `video` — combined validation reel (when validation samples were provided).
  • `debug_dataset` — preprocessed-data archive, only when `debug_dataset` is enabled.

Billing

A successful run is billed `max(100, number_of_steps)` billable units. Requests that fail before training completes (input-validation errors / HTTP 422, or dataset-download failures) are billed 0 units.

How the Training Works

Pipeline Overview
  1. Preprocessing — the archive is extracted, clips fit to the resolution bucket (optionally scene-split), and audio prepared when audio extension is enabled.
  2. Training — for each clip, the last `conditioning_frames` frames are held clean as the suffix and the model learns to generate the preceding lead-in. With audio extension on, the matching closing audio window conditions the audio so it stays in sync. Validation previews run at intervals.
  3. Output — the LoRA, config, and validation reel are uploaded.
What Happens to Your Data
  • Archive extraction: the `.zip` is unpacked; macOS `__MACOSX` metadata folders are ignored.
  • Video fitting: clips are resized to fill the resolution bucket and center-cropped; with `auto_scale_input` they are resampled to the target frame rate/count.
  • Suffix carving: the trailing `conditioning_frames` frames of each clip are kept clean as conditioning; everything before is the generation target. This happens automatically.
  • Audio window: when audio extension is on, the matching closing seconds of audio condition the audio so audio and video stay aligned.
  • Captions: the trigger phrase (if set) is prepended.

Tips for Getting Good Results

Dataset Quality
  • Use at least 10 clips that contain the kind of backward continuation (lead-in) you want learned.
  • Clips should be long enough that there is a meaningful lead-in beyond the suffix window.
  • For audio extension, ensure every clip carries clean, in-sync audio.
Caption Best Practices
  • Describe the action plainly, optionally with a trigger phrase.
  • Keep captions consistent across clips.

Good caption: `a person walking up to and opening a door` Weak caption: `door`

Trigger Phrases
  • Use a distinctive trigger phrase to invoke a particular lead-in style; include it in every caption and at inference.
Inference Format Matching

At inference, supply a closing clip at least `conditioning_frames` long, use the same caption style and trigger phrase, and match the audio setting.

json
{
  "training_data_url": "https://example.com/clips.zip",
  "trigger_phrase": "",
  "rank": 32,
  "number_of_steps": 2000,
  "learning_rate": 0.0002,
  "conditioning_frames": 8,
  "number_of_frames": 89,
  "frame_rate": 24,
  "resolution": "medium",
  "aspect_ratio": "1:1",
  "with_audio": false,
  "validation": [
    { "prompt": "a person approaching a door", "video_url": "https://example.com/closing.mp4" }
  ]
}
Diagnosing Issues
  • Lead-in drifts or does not flow into the suffix: add more representative clips; try a slightly longer `conditioning_frames` for more closing context.
  • Validation rejected for short clip: provide a validation clip at least `conditioning_frames` long, or lower `conditioning_frames`.
  • Audio extension rejected: when `with_audio` is true, every training and validation clip must have an audio track.
  • Overfitting: fewer steps, lower `rank`, more clips.
Validation Prompt Tips
  • Use a fresh closing clip to gauge generalization.
  • Keep the prompt aligned with the lead-in you expect.
Common Pitfalls
  • `conditioning_frames` not a multiple of 8 (use 8, 16, 24, ...).
  • A suffix window so long it covers the whole clip (nothing left to generate).
  • Enabling audio extension with silent clips.