LTX 2.3 Trainer (V2) - Keyframe Interpolation (Training) API on fal

LTX 2.3 Trainer — Keyframe Interpolation (`/interpolate`)

Overview

The `/interpolate` endpoint trains a LoRA for the LTX 2.3 model that generates the video between keyframes. During training, the first and last frames (and optionally a middle frame) of each clip are kept clean and the model learns to generate the in-between motion. At inference you supply a start image and an end image (and optionally a middle image), and the model produces a smooth video connecting them.

Key features:

Learns first + last (optionally + middle) keyframe → video interpolation.
Optional middle keyframe for first+middle+last interpolation.
Trains on plain video clips (plus optional captions) — keyframes are taken from each clip automatically.
Video-only (no audio is learned).

Dataset Format

Provide a single `.zip` archive (linked via `training_data_url`) of plain videos:

Videos: `.mp4`, `.mov`, `.webm`, `.mkv`, `.avi`
Captions: a `.txt` with the same base name as each video (optional but recommended).

Images are rejected (there is nothing to interpolate over time on a still). The first and last frames of each clip (and the middle, if enabled) become the kept keyframes; the model learns to generate the rest. Aim for at least 10 clips. File names must be unique across the archive.

Minimum clip length: with `auto_scale_input` off (the default), each video should have at least `number_of_frames` frames (default 89, ≈ 3.7 s at 24 fps). Shorter clips are dropped during preprocessing; if no clip is long enough, the run fails with no usable training data. Turn on `auto_scale_input` to resample shorter clips to the target frame count instead.

Input Parameters Reference

Dataset

`training_data_url` (required)

Type: `string`

URL to the `.zip` archive of videos.

`trigger_phrase`

Type: `string` Default: `""`

Phrase prepended to captions during training; include it at inference.

Training Parameters

`rank`

Type: `integer` (`8`, `16`, `32`, `64`, `128`) Default: `32`

LoRA capacity.

`number_of_steps`

Type: `integer` Default: `2000` (range `100`–`20000`)

Number of optimization steps.

`learning_rate`

Type: `number` Default: `0.0002`

Optimization step size.

`include_middle_keyframe`

Type: `boolean` Default: `false`

When true, a middle keyframe is also kept (first + middle + last → video) instead of just first + last. When enabled, every validation sample must also provide a middle keyframe image.

Video Configuration

`number_of_frames`

Type: `integer` Default: `89` (range `9`–`121`)

Frames per training clip. Must satisfy `frames % 8 == 1`; other values are snapped down to the nearest valid count.

`frame_rate`

Type: `integer` Default: `24` (range `8`–`60`)

Target frames per second.

`resolution`

Type: `string` (`low`, `medium`, `high`) Default: `medium`

Resolution	16:9	1:1	9:16
low	512×288	512×512	288×512
medium	768×448	768×768	448×768
high	960×544	960×960	544×960

`aspect_ratio`

Type: `string` (`16:9`, `1:1`, `9:16`) Default: `1:1`

`auto_scale_input`

Type: `boolean` Default: `false`

Fit videos to the target frame count and frame rate.

`split_input_into_scenes`

Type: `boolean` Default: `true`

Split long clips into scenes before training.

`split_input_duration_threshold`

Type: `number` Default: `30.0` (range `1.0`–`60.0`)

Duration above which a clip is eligible for scene splitting.

Validation

`validation`

Type: `array` Default: `[]` (max 2 entries)

Validation samples, each an object with:

`prompt` (`string`) — the text prompt.
`start_image_url` (`string`, required) — the first-frame keyframe image.
`end_image_url` (`string`, required) — the last-frame keyframe image.
`middle_image_url` (`string`, optional) — a middle keyframe image; required when `include_middle_keyframe` is true.

`validation_negative_prompt`

Type: `string` Default: a built-in quality negative prompt.

`validation_number_of_frames`

Type: `integer` Default: `89` (range `9`–`121`)

`validation_frame_rate`

Type: `integer` Default: `24` (range `8`–`60`)

`validation_resolution`

Type: `string` Default: `high`

`validation_aspect_ratio`

Type: `string` Default: `1:1`

`stg_scale`

Type: `number` Default: `1.0` (range `0.0`–`3.0`)

`debug_dataset`

Type: `boolean` Default: `false`

Return an archive of the preprocessed data for inspection.

Outputs

`lora_file` — the trained LoRA weights (`.safetensors`).
`config_file` — JSON describing the trigger phrase and training type.
`video` — combined validation reel (when validation samples were provided).
`debug_dataset` — preprocessed-data archive, only when `debug_dataset` is enabled.

Billing

A successful run is billed `max(100, number_of_steps)` billable units. Requests that fail before training completes (input-validation errors / HTTP 422, or dataset-download failures) are billed 0 units.

How the Training Works

Pipeline Overview

Preprocessing — the archive is extracted, clips fit to the resolution bucket (optionally scene-split), and a temporal pattern is built per clip that keeps the first and last (and optionally middle) frames clean.
Training — the kept keyframes are held fixed and the model learns to generate the in-between frames. Validation previews run at intervals.
Output — the LoRA, config, and validation reel are uploaded.

What Happens to Your Data

Archive extraction: the `.zip` is unpacked; macOS metadata and hidden files are ignored.
Video fitting: clips are resized to fill the resolution bucket and center-cropped; with `auto_scale_input` they are resampled to the target frame rate/count.
Keyframe selection: the first and last frames (and the middle, if enabled) of each clip are kept clean as the keyframes; the rest is the generation target. This happens automatically.
Captions: the trigger phrase (if set) is prepended.

How Interpolation Training Works

The model is trained to fill in the motion between fixed keyframes. The trained LoRA specializes the base model at producing smooth, plausible transitions consistent with your footage. At inference you supply the start and end images (and optionally a middle image) plus a prompt.

Tips for Getting Good Results

Dataset Quality

Use at least 10 clips whose start→end motion is representative of the transitions you want learned.
Clips should contain clear, coherent motion between their first and last frames.
Keyframes (first/last frames) should be clean and sharp, since they anchor generation.

Caption Best Practices

Describe the motion/transition plainly, optionally with a trigger phrase.
Keep captions consistent across clips.

Good caption: `a flower blooming from bud to full bloom` Weak caption: `flower`

Trigger Phrases

A distinctive trigger phrase helps invoke a particular interpolation style; include it in every caption and at inference.

Inference Format Matching

At inference, supply start and end images (and a middle image if you trained with `include_middle_keyframe`), and use the same caption style and trigger phrase.

Recommended Starting Configuration

json
{
  "training_data_url": "https://example.com/motion_clips.zip",
  "trigger_phrase": "",
  "rank": 32,
  "number_of_steps": 2000,
  "learning_rate": 0.0002,
  "include_middle_keyframe": false,
  "number_of_frames": 89,
  "frame_rate": 24,
  "resolution": "medium",
  "aspect_ratio": "1:1",
  "validation": [
    {
      "prompt": "a flower blooming",
      "start_image_url": "https://example.com/bud.png",
      "end_image_url": "https://example.com/bloom.png"
    }
  ]
}

Diagnosing Issues

Transitions are jumpy or implausible: add more clips with smooth motion; ensure keyframes are clean.
Middle keyframe missing error: when `include_middle_keyframe` is true, every validation sample needs `middle_image_url`.
Overfitting: fewer steps, lower `rank`, more clips.

Validation Prompt Tips

Use fresh keyframe images to gauge generalization.
Keep the prompt aligned with the transition you expect between the keyframes.

Common Pitfalls

Uploading still images instead of clips (rejected).
Enabling `include_middle_keyframe` but omitting `middle_image_url` in validation.
Keyframes that are blurry or unrepresentative.

fal-ai/ltx23-trainer-v2/interpolate

Input

Training history

Nothing here yet...

LTX 2.3 Trainer — Keyframe Interpolation (`/interpolate`)

Overview

Dataset Format

Input Parameters Reference

Dataset

`training_data_url` (required)

`trigger_phrase`

Training Parameters

`rank`

`number_of_steps`

`learning_rate`

`include_middle_keyframe`

Video Configuration

`number_of_frames`

`frame_rate`

`resolution`

`aspect_ratio`

`auto_scale_input`

`split_input_into_scenes`

`split_input_duration_threshold`

Validation

`validation`

`validation_negative_prompt`

`validation_number_of_frames`

`validation_frame_rate`

`validation_resolution`

`validation_aspect_ratio`

`stg_scale`

`debug_dataset`

Outputs

Billing

How the Training Works

Pipeline Overview

What Happens to Your Data

How Interpolation Training Works

Tips for Getting Good Results

Dataset Quality

Caption Best Practices

Trigger Phrases

Inference Format Matching

Recommended Starting Configuration

Diagnosing Issues

Validation Prompt Tips

Common Pitfalls