LTX 2.3 Trainer (V2) - Video Inpainting (Training) API on fal

LTX 2.3 Trainer — Video Inpainting (`/inpaint`)

Overview

The `/inpaint` endpoint trains a LoRA for the LTX 2.3 model that regenerates a masked region of a video while keeping the rest unchanged. Each training clip is paired with a mask marking the area to regenerate; the model learns to fill that region in a way that blends with the kept pixels. At inference you supply a video and a mask, and the model regenerates only the masked area.

Inpainting operates spatially/temporally over the video, so it is always video-only; audio is unaffected.

Key features:

Learns to regenerate a masked region while preserving the unmasked surroundings.
Masks can be a single image (applied to every frame) or a video mask (per-frame).
Standard mask convention: WHITE = the region to regenerate/edit, BLACK = keep unchanged.
Validation previews inpaint a supplied video using a supplied mask.

Dataset Format

Provide a single `.zip` archive (linked via `training_data_url`) where each example is a clip plus its mask:

`<name>.<ext>` — the source video (`.mp4`, `.mov`, `.webm`, `.mkv`, `.avi`).
`<name>_mask.<ext>` — the mask. Either an image (`.png`, `.jpg`, `.jpeg`, `.bmp`, `.webp`) applied to all frames, or a video mask (per-frame). WHITE marks the region to regenerate; BLACK is kept.
`<name>.txt` — optional caption.

Every clip needs a matching `<name>_mask` file. The mask resolution need not match the video; it is resized automatically. A video mask is normalized to its clip's frame count automatically — a shorter mask freeze-holds its last frame, and a longer mask is trimmed. (An image mask is applied to every frame.) File names must be unique across the archive. Aim for at least 10 examples.

Minimum clip length: with `auto_scale_input` off (the default), each video should have at least `number_of_frames` frames (default 89, ≈ 3.7 s at 24 fps). Shorter clips are dropped during preprocessing; if no clip is long enough, the run fails with no usable training data. Turn on `auto_scale_input` to resample shorter clips to the target frame count instead.

Example layout:


clip01.mp4   clip01_mask.png   clip01.txt
clip02.mp4   clip02_mask.mp4   clip02.txt

Video masks require a probeable clip. If a clip paired with a video mask has a frame count that cannot be determined (for example, an unusual or variable-frame-rate encoding), the request is rejected (HTTP 422). Re-encode the clip to a standard format, or supply an image mask instead.

Input Parameters Reference

Dataset

`training_data_url` (required)

Type: `string`

URL to the `.zip` archive of clip + mask examples.

`trigger_phrase`

Type: `string` Default: `""`

Phrase prepended to captions during training; include it at inference.

Training Parameters

`rank`

Type: `integer` (`8`, `16`, `32`, `64`, `128`) Default: `32`

LoRA capacity.

`number_of_steps`

Type: `integer` Default: `2000` (range `100`–`20000`)

Number of optimization steps.

`learning_rate`

Type: `number` Default: `0.0002`

Optimization step size.

Video Configuration

`number_of_frames`

Type: `integer` Default: `89` (range `9`–`121`)

Frames per training clip. Must satisfy `frames % 8 == 1`; other values are snapped down to the nearest valid count.

`frame_rate`

Type: `integer` Default: `24` (range `8`–`60`)

Target frames per second.

`resolution`

Type: `string` (`low`, `medium`, `high`) Default: `medium`

Resolution	16:9	1:1	9:16
low	512×288	512×512	288×512
medium	768×448	768×768	448×768
high	960×544	960×960	544×960

`aspect_ratio`

Type: `string` (`16:9`, `1:1`, `9:16`) Default: `1:1`

`auto_scale_input`

Type: `boolean` Default: `false`

Fit videos to the target frame count and frame rate.

`split_input_into_scenes`

Type: `boolean` Default: `false`

Off for inpainting: scene splitting would desync a clip from its mask. Provide pre-split clips instead.

`split_input_duration_threshold`

Type: `number` Default: `30.0` (range `1.0`–`60.0`)

Duration threshold for scene splitting (only relevant if splitting were enabled).

Validation

`validation`

Type: `array` Default: `[]` (max 2 entries)

Validation samples, each an object with:

`prompt` (`string`) — the text prompt.
`video_url` (`string`, required) — the source video to inpaint.
`mask_url` (`string`, required) — the mask (image or video). WHITE = regenerate, BLACK = keep. Resolution is resized automatically.

`validation_negative_prompt`

Type: `string` Default: a built-in quality negative prompt.

`validation_number_of_frames`

Type: `integer` Default: `89` (range `9`–`121`)

`validation_frame_rate`

Type: `integer` Default: `24` (range `8`–`60`)

`validation_resolution`

Type: `string` Default: `high`

`validation_aspect_ratio`

Type: `string` Default: `1:1`

`stg_scale`

Type: `number` Default: `1.0` (range `0.0`–`3.0`)

`debug_dataset`

Type: `boolean` Default: `false`

Return an archive of the preprocessed data for inspection.

Outputs

`lora_file` — the trained LoRA weights (`.safetensors`).
`config_file` — JSON describing the trigger phrase and training type.
`video` — combined validation reel (when validation samples were provided).
`debug_dataset` — preprocessed-data archive, only when `debug_dataset` is enabled.

Billing

A successful run is billed `max(100, number_of_steps)` billable units. Requests that fail before training completes (input-validation errors / HTTP 422, or dataset-download failures) are billed 0 units.

How the Training Works

Pipeline Overview

Preprocessing — the archive is extracted, clips matched to their masks and captions, masks converted to the internal convention, and a video mask normalized to its clip's frame count.
Training — the model regenerates the masked region while the unmasked pixels are held as conditioning. Validation previews run at intervals.
Output — the LoRA, config, and validation reel are uploaded.

What Happens to Your Data

Archive extraction: the `.zip` is unpacked; macOS metadata and hidden files are ignored.
Clip/mask matching: each `<name>` clip is paired with its `<name>_mask` file and optional `<name>.txt`. A clip without a mask causes a clear error.
Mask handling: your WHITE = edit / BLACK = keep mask is converted to the internal convention automatically. Image masks are applied to every frame; a video mask is normalized to its clip's frame count (a shorter mask freeze-holds its last frame, a longer mask is trimmed).
Video fitting: clips are resized to fill the resolution bucket and center-cropped.
Captions: the trigger phrase (if set) is prepended.

Tips for Getting Good Results

Dataset Quality

Use at least 10 clip+mask examples representative of the regions you want regenerated.
Make masks cleanly cover the area to edit, with a little margin so edges blend.
Keep the kept (BLACK) region exactly as you want it preserved.

Mask Best Practices

Remember: WHITE = regenerate, BLACK = keep.
For a static region, a single image mask is simplest. For a moving region, use a per-frame video mask.
Masks are resized to match the video automatically; you do not need to match resolutions.

Caption Best Practices

Describe what should appear in the regenerated region (and the overall scene), optionally with a trigger phrase.
Keep captions consistent across examples.

Good caption: `a person walking, with the logo on their shirt replaced by tronlog0` Weak caption: `person`

Scene Splitting

Scene splitting is off because it would desync a clip from its mask. If you have long footage, pre-split the clips (and their masks) before uploading.

Inference Format Matching

At inference, supply a video and a mask in the same WHITE=edit convention, and use the same caption style and trigger phrase.

Recommended Starting Configuration

json
{
  "training_data_url": "https://example.com/inpaint_examples.zip",
  "trigger_phrase": "",
  "rank": 32,
  "number_of_steps": 2000,
  "learning_rate": 0.0002,
  "number_of_frames": 89,
  "frame_rate": 24,
  "resolution": "medium",
  "aspect_ratio": "1:1",
  "validation": [
    {
      "prompt": "a person walking",
      "video_url": "https://example.com/source.mp4",
      "mask_url": "https://example.com/mask.png"
    }
  ]
}

Diagnosing Issues

Regenerated region does not blend: add more examples; make masks cover the edit area cleanly with a small margin; describe the desired content in captions.
Wrong area edited: check mask polarity (WHITE = regenerate). Inverted masks edit the wrong region.
Mask/clip desync: avoid scene splitting; provide matched, pre-split clips and masks.
Overfitting: fewer steps, lower `rank`, more examples.

Validation Prompt Tips

Use a fresh video + mask to gauge generalization.
Describe the content you expect in the masked region.

Common Pitfalls

Inverted mask polarity (BLACK where you meant WHITE).
Missing `<name>_mask` files.
Relying on scene splitting (disabled here).

fal-ai/ltx23-trainer-v2/inpaint

Input

Training history

Nothing here yet...

LTX 2.3 Trainer — Video Inpainting (`/inpaint`)

Overview

Dataset Format

Input Parameters Reference

Dataset

`training_data_url` (required)

`trigger_phrase`

Training Parameters

`rank`

`number_of_steps`

`learning_rate`

Video Configuration

`number_of_frames`

`frame_rate`

`resolution`

`aspect_ratio`

`auto_scale_input`

`split_input_into_scenes`

`split_input_duration_threshold`

Validation

`validation`

`validation_negative_prompt`

`validation_number_of_frames`

`validation_frame_rate`

`validation_resolution`

`validation_aspect_ratio`

`stg_scale`

`debug_dataset`

Outputs

Billing

How the Training Works

Pipeline Overview

What Happens to Your Data

Tips for Getting Good Results

Dataset Quality

Mask Best Practices

Caption Best Practices

Scene Splitting

Inference Format Matching

Recommended Starting Configuration

Diagnosing Issues

Validation Prompt Tips

Common Pitfalls