LTX 2.3 Trainer (V2) - Spatial Outpainting (Training) API on fal

LTX 2.3 Trainer — Spatial Outpainting (`/outpaint`)

Overview

The `/outpaint` endpoint trains a LoRA for the LTX 2.3 model that expands the frame outward — it keeps an inner rectangle of each video fixed and generates the surrounding region. During training, a kept inner rectangle is held pixel-faithful and the model learns to generate everything outside it. At inference you supply footage whose inner region should be preserved while the surround is filled in.

Outpainting is purely spatial, so it is always video-only — there is no audio modality (the soundtrack is unchanged by frame expansion).

Key features:

Learns to generate the area surrounding a kept inner rectangle of each frame.
The kept rectangle is defined in training-resolution pixels via `crop_top_left` / `crop_bottom_right`.
Trains on plain full-canvas videos (plus optional captions).
Validation previews keep the supplied inner region and regenerate the surround.

Dataset Format

Provide a single `.zip` archive (linked via `training_data_url`) of plain full-canvas videos:

Videos: `.mp4`, `.mov`, `.avi`, `.mkv`
Captions: a `.txt` with the same base name as each video (optional but recommended).

Images are rejected (outpainting operates over a multi-frame clip). Provide the full canvas you want the model to learn — the kept inner rectangle is specified by the crop parameters, not by cropping your clips. Aim for at least 10 clips. Files in subfolders are fine — clips with the same name in different subfolders are kept distinct automatically.

Minimum clip length: with `auto_scale_input` off (the default), each video must already have at least `number_of_frames` frames (default 89, ≈ 3.7 s at 24 fps). Shorter clips are silently skipped, and if every clip is too short the request fails (422, "All training videos are too short to be trainable"). Turn on `auto_scale_input` to resample shorter clips to the target frame count instead.

Input Parameters Reference

Dataset

`training_data_url` (required)

Type: `string`

URL to the `.zip` archive of full-canvas videos.

`trigger_phrase`

Type: `string` Default: `""`

Phrase prepended to captions during training; include it at inference.

Training Parameters

`rank`

Type: `integer` (`8`, `16`, `32`, `64`, `128`) Default: `32`

LoRA capacity.

`number_of_steps`

Type: `integer` Default: `2000` (range `100`–`20000`)

Number of optimization steps.

`learning_rate`

Type: `number` Default: `0.0002`

Optimization step size.

Outpainting Region

`crop_top_left` (required)

Type: `array` of two integers `[x1, y1]`

Top-left corner of the kept inner rectangle, in training-resolution pixels — i.e. the pixel grid chosen by `resolution` × `aspect_ratio`, not the uploaded video's native size (clips are resized to that grid before the crop is applied). Coordinates snap to a 32-pixel grid. Example: `[0, 0]`.

`crop_bottom_right` (required)

Type: `array` of two integers `[x2, y2]`

Bottom-right corner of the kept inner rectangle, in training-resolution pixels. Must satisfy `x2 > x1` and `y2 > y1` and lie within the training frame. Everything outside this rectangle is what the model generates. Example: `[512, 512]`.

The kept rectangle must not cover the entire frame after snapping to the 32-pixel grid (there must be a margin of at least ~32 pixels on at least one side to leave something to outpaint), and it must span at least ~32 pixels in each dimension.

Video Configuration

`number_of_frames`

Type: `integer` Default: `89` (range `9`–`121`)

Frames per training clip. Must satisfy `frames % 8 == 1`; other values are snapped down to the nearest valid count.

`frame_rate`

Type: `integer` Default: `24` (range `8`–`60`)

Target frames per second.

`resolution`

Type: `string` (`low`, `medium`, `high`) Default: `medium`

The training frame size — and therefore the coordinate space for the crop:

Resolution	16:9	1:1	9:16
low	512×288	512×512	288×512
medium	768×448	768×768	448×768
high	960×544	960×960	544×960

`aspect_ratio`

Type: `string` (`16:9`, `1:1`, `9:16`) Default: `1:1`

`auto_scale_input`

Type: `boolean` Default: `false`

Fit videos to the target frame count and frame rate.

`split_input_into_scenes`

Type: `boolean` Default: `true`

Split long clips into scenes before training.

`split_input_duration_threshold`

Type: `number` Default: `30.0` (range `1.0`–`60.0`)

Duration above which a clip is eligible for scene splitting.

Validation

`validation`

Type: `array` Default: `[]` (max 2 entries)

Validation samples, each an object with:

`prompt` (`string`) — the text prompt.
`video_url` (`string`, required) — a full-canvas video whose surround should be outpainted. The inner rectangle (scaled to the validation resolution) is what's kept.

`validation_negative_prompt`

Type: `string` Default: a built-in quality negative prompt.

`validation_number_of_frames`

Type: `integer` Default: `89` (range `9`–`121`)

`validation_frame_rate`

Type: `integer` Default: `24` (range `8`–`60`)

`validation_resolution`

Type: `string` Default: `high`

The crop is scaled from the training resolution to the validation resolution automatically.

`validation_aspect_ratio`

Type: `string` Default: `1:1`

`stg_scale`

Type: `number` Default: `1.0` (range `0.0`–`3.0`)

`debug_dataset`

Type: `boolean` Default: `false`

Return an archive of the preprocessed data for inspection.

Outputs

`lora_file` — the trained LoRA weights (`.safetensors`).
`config_file` — JSON describing the trigger phrase and training type.
`video` — combined validation reel (when validation samples were provided).
`debug_dataset` — preprocessed-data archive, only when `debug_dataset` is enabled.

Billing

A successful run is billed `max(100, number_of_steps)` billable units. Requests that fail before training completes (input-validation errors / HTTP 422, or dataset-download failures) are billed 0 units.

How the Training Works

Pipeline Overview

Preprocessing — the archive is extracted, full-canvas clips fit to the resolution bucket (optionally scene-split).
Training — the kept inner rectangle is held pixel-faithful and excluded from the learning target; the model learns to generate the surrounding region. Validation previews run at intervals.
Output — the LoRA, config, and validation reel are uploaded.

What Happens to Your Data

Archive extraction: the `.zip` is unpacked; macOS `__MACOSX` metadata folders are ignored.
Video fitting: clips are resized to fill the resolution bucket and center-cropped to the exact training frame size — this is the coordinate space your crop rectangle refers to.
Kept rectangle: the rectangle from `crop_top_left`..`crop_bottom_right` is snapped to a 32-pixel grid and held fixed; everything outside it is the generation target.
Captions: the trigger phrase (if set) is prepended.
Audio: outpainting is spatial-only; the original soundtrack is unaffected (no audio is generated or trained).

Tips for Getting Good Results

Dataset Quality

Use at least 10 full-canvas clips representative of the surroundings you want the model to learn to generate.
The content inside and outside the kept rectangle should be coherent (the model learns how the surround relates to the kept center).

Choosing the Crop Rectangle

Specify the rectangle in training-resolution pixels (from the resolution/aspect-ratio table above), not your source video's native size.
Leave at least a ~32-pixel margin on at least one side so there is a region to outpaint.
Keep the rectangle at least ~32 pixels wide and tall.

Caption Best Practices

Describe the full scene (kept center and surroundings) plainly, optionally with a trigger phrase.
Keep captions consistent across clips.

Good caption: `a chef cooking at a stove in a busy restaurant kitchen` Weak caption: `cooking`

Trigger Phrases

Use a distinctive trigger phrase to invoke a particular surround style; include it in every caption and at inference.

Inference Format Matching

At inference, supply a full-canvas clip and a kept rectangle consistent with how you trained, and use the same caption style and trigger phrase.

Recommended Starting Configuration

json
{
  "training_data_url": "https://example.com/full_canvas_clips.zip",
  "trigger_phrase": "",
  "rank": 32,
  "number_of_steps": 2000,
  "learning_rate": 0.0002,
  "crop_top_left": [128, 128],
  "crop_bottom_right": [640, 640],
  "number_of_frames": 89,
  "frame_rate": 24,
  "resolution": "medium",
  "aspect_ratio": "1:1",
  "validation": [
    { "prompt": "a chef in a busy kitchen", "video_url": "https://example.com/scene.mp4" }
  ]
}

Diagnosing Issues

Surround does not match the kept center: add more coherent full-canvas clips; keep captions descriptive of the whole scene.
Crop rejected: ensure the rectangle is inside the training frame, at least ~32 px in each dimension, and leaves a margin so it does not cover the whole frame.
Overfitting: fewer steps, lower `rank`, more clips.

Validation Prompt Tips

Use a fresh full-canvas clip to gauge generalization.
Keep the prompt describing the whole scene, including the expected surround.

Common Pitfalls

Giving crop coordinates in the source video's native size instead of training-resolution pixels.
A kept rectangle that covers (nearly) the whole frame — nothing left to outpaint.
Cropping your clips before upload (provide the full canvas; let the crop parameters define the kept region).

fal-ai/ltx23-trainer-v2/outpaint

Input

Training history

Nothing here yet...