fal-ai/ltx23-trainer-v2/extend-suffix
Input
Hint: Upload a prepared .zip archive or provide a URL. See the field description for the required file layout.
Customize your input with more control.
The cost of training depends on the number of steps. The formula is: 0.0061 * steps. With 1000 steps, your request will cost $6.10.
Training history
Nothing here yet...
Fine-tune your training parameters and start right now.
LTX 2.3 Trainer — Backward Video Extension (`/extend-suffix`)
Overview
The `/extend-suffix` endpoint trains a LoRA for the LTX 2.3 model that generates the lead-in to a video — it extends a clip backward in time. During training, the last N frames of each clip are kept as a clean "suffix" and the model learns to generate the frames that lead up to them. At inference you supply a closing clip and the model produces the preceding section.
Key features:
- Learns to extend video backward, generating a plausible lead-in to a clean closing (suffix) window.
- Optional joint audio extension: when enabled, audio is generated in sync with the video from the same closing window.
- Trains on plain videos (plus optional captions) — the suffix is carved from each clip automatically.
- Validation previews extend a supplied clip backward.
Dataset Format
Provide a single `.zip` archive (linked via `training_data_url`) of plain videos:
- Videos:
`.mp4`,`.mov`,`.avi`,`.mkv` - Captions: a
`.txt`with the same base name as each video (optional but recommended).
Images are rejected (there is nothing to extend over time on a still). If you enable audio extension, every training clip — and every validation clip — must contain an audio track. Aim for at least 10 clips. Files in subfolders are fine — clips with the same name in different subfolders are kept distinct automatically.
Minimum clip length: with `auto_scale_input` off (the default), each video must already have at least `number_of_frames` frames (default 89, ≈ 3.7 s at 24 fps). Shorter clips are silently skipped, and if every clip is too short the request fails (422, "All training videos are too short to be trainable"). Turn on `auto_scale_input` to resample shorter clips to the target frame count instead.
With `with_audio` enabled, audio is required on every clip. Joint audio extension requires every training clip and every validation clip to contain an audio track; if any lacks one, the request is rejected (HTTP 422). With `with_audio` off, audio is ignored and the output is silent.
Input Parameters Reference
Dataset
`training_data_url` (required)
Type: `string`
URL to the `.zip` archive of videos.
`trigger_phrase`
Type: `string`
Default: `""`
Phrase prepended to captions during training; include it at inference.
Training Parameters
`rank`
Type: `integer` (`8`, `16`, `32`, `64`, `128`)
Default: `32`
LoRA capacity.
`number_of_steps`
Type: `integer`
Default: `2000` (range `100`–`20000`)
Number of optimization steps.
`learning_rate`
Type: `number`
Default: `0.0002`
Optimization step size.
`conditioning_frames`
Type: `integer`
Default: `8` (range `8`–`121`)
Number of trailing frames kept as the clean suffix the model leads up to. Must be `≡ 0 (mod 8)` (e.g. 8, 16, 24). It must be short enough that there is a lead-in left to generate — the suffix cannot cover the whole `number_of_frames` (or `validation_number_of_frames`) clip.
| Value | Behavior |
|---|---|
| 8 | Short suffix — the model generates most of the lead-in (default) |
| 16–24 | Longer closing context before the generated lead-in |
Video Configuration
`number_of_frames`
Type: `integer`
Default: `89` (range `9`–`121`)
Frames per training clip. Must satisfy `frames % 8 == 1` (other values are snapped down to the nearest valid count) and must be larger than the suffix window.
`frame_rate`
Type: `integer`
Default: `24` (range `8`–`60`)
Target frames per second.
`resolution`
Type: `string` (`low`, `medium`, `high`)
Default: `medium`
| Resolution | 16:9 | 1:1 | 9:16 |
|---|---|---|---|
| low | 512×288 | 512×512 | 288×512 |
| medium | 768×448 | 768×768 | 448×768 |
| high | 960×544 | 960×960 | 544×960 |
`aspect_ratio`
Type: `string` (`16:9`, `1:1`, `9:16`)
Default: `1:1`
`auto_scale_input`
Type: `boolean`
Default: `false`
Fit videos to the target frame count and frame rate.
`split_input_into_scenes`
Type: `boolean`
Default: `true`
Split long clips into scenes before training.
`split_input_duration_threshold`
Type: `number`
Default: `30.0` (range `1.0`–`60.0`)
Duration above which a clip is eligible for scene splitting.
Audio Configuration
`with_audio`
Type: `boolean`
Default: `false`
Set `true` to jointly extend audio and video — the model generates both leading up to the same closing window so they stay in sync. Requires every training clip and every validation clip to contain an audio track. Default (`false`) produces a video-only, silent extension.
`audio_normalize`
Type: `boolean`
Default: `true`
Peak-normalize audio for consistent loudness (used when audio extension is enabled).
`audio_preserve_pitch`
Type: `boolean`
Default: `true`
Preserve pitch when fitting audio to video duration.
Validation
`validation`
Type: `array`
Default: `[]` (max 2 entries)
Validation samples, each an object with:
`prompt`(`string`) — the text prompt.`video_url`(`string`, required) — a video to extend backward. Its last`conditioning_frames`frames are used as the suffix.
The validation clip must be at least `conditioning_frames` long (at the validation frame rate).
`validation_negative_prompt`
Type: `string`
Default: a built-in quality negative prompt.
`validation_number_of_frames`
Type: `integer`
Default: `89` (range `9`–`121`)
Must be larger than the suffix window so there is a lead-in to preview.
`validation_frame_rate`
Type: `integer`
Default: `24` (range `8`–`60`)
`validation_resolution`
Type: `string`
Default: `high`
`validation_aspect_ratio`
Type: `string`
Default: `1:1`
`stg_scale`
Type: `number`
Default: `1.0` (range `0.0`–`3.0`)
`debug_dataset`
Type: `boolean`
Default: `false`
Return an archive of the preprocessed data for inspection.
Outputs
`lora_file`— the trained LoRA weights (`.safetensors`).`config_file`— JSON describing the trigger phrase and training type.`video`— combined validation reel (when validation samples were provided).`debug_dataset`— preprocessed-data archive, only when`debug_dataset`is enabled.
Billing
A successful run is billed `max(100, number_of_steps)` billable units. Requests that fail before training completes (input-validation errors / HTTP 422, or dataset-download failures) are billed 0 units.
How the Training Works
Pipeline Overview
- Preprocessing — the archive is extracted, clips fit to the resolution bucket (optionally scene-split), and audio prepared when audio extension is enabled.
- Training — for each clip, the last
`conditioning_frames`frames are held clean as the suffix and the model learns to generate the preceding lead-in. With audio extension on, the matching closing audio window conditions the audio so it stays in sync. Validation previews run at intervals. - Output — the LoRA, config, and validation reel are uploaded.
What Happens to Your Data
- Archive extraction: the
`.zip`is unpacked; macOS`__MACOSX`metadata folders are ignored. - Video fitting: clips are resized to fill the resolution bucket and center-cropped; with
`auto_scale_input`they are resampled to the target frame rate/count. - Suffix carving: the trailing
`conditioning_frames`frames of each clip are kept clean as conditioning; everything before is the generation target. This happens automatically. - Audio window: when audio extension is on, the matching closing seconds of audio condition the audio so audio and video stay aligned.
- Captions: the trigger phrase (if set) is prepended.
Tips for Getting Good Results
Dataset Quality
- Use at least 10 clips that contain the kind of backward continuation (lead-in) you want learned.
- Clips should be long enough that there is a meaningful lead-in beyond the suffix window.
- For audio extension, ensure every clip carries clean, in-sync audio.
Caption Best Practices
- Describe the action plainly, optionally with a trigger phrase.
- Keep captions consistent across clips.
Good caption: `a person walking up to and opening a door`
Weak caption: `door`
Trigger Phrases
- Use a distinctive trigger phrase to invoke a particular lead-in style; include it in every caption and at inference.
Inference Format Matching
At inference, supply a closing clip at least `conditioning_frames` long, use the same caption style and trigger phrase, and match the audio setting.
Recommended Starting Configuration
json{ "training_data_url": "https://example.com/clips.zip", "trigger_phrase": "", "rank": 32, "number_of_steps": 2000, "learning_rate": 0.0002, "conditioning_frames": 8, "number_of_frames": 89, "frame_rate": 24, "resolution": "medium", "aspect_ratio": "1:1", "with_audio": false, "validation": [ { "prompt": "a person approaching a door", "video_url": "https://example.com/closing.mp4" } ] }
Diagnosing Issues
- Lead-in drifts or does not flow into the suffix: add more representative clips; try a slightly longer
`conditioning_frames`for more closing context. - Validation rejected for short clip: provide a validation clip at least
`conditioning_frames`long, or lower`conditioning_frames`. - Audio extension rejected: when
`with_audio`is true, every training and validation clip must have an audio track. - Overfitting: fewer steps, lower
`rank`, more clips.
Validation Prompt Tips
- Use a fresh closing clip to gauge generalization.
- Keep the prompt aligned with the lead-in you expect.
Common Pitfalls
`conditioning_frames`not a multiple of 8 (use 8, 16, 24, ...).- A suffix window so long it covers the whole clip (nothing left to generate).
- Enabling audio extension with silent clips.