fal-ai/ltx23-trainer-v2/inpaint

Train a LoRA that regenerates a masked region of a video while keeping the rest unchanged, blending the new content with its surroundings.
Training
Commercial use

Input

Additional Settings

Customize your input with more control.

The cost of training depends on the number of steps. The formula is: 0.0024 * steps. With 1000 steps, your request will cost $2.40.

Training history

Note: these are the most recent training requests. For the full history, check the requests tab.

LTX 2.3 Trainer — Video Inpainting (`/inpaint`)

Overview

The `/inpaint` endpoint trains a LoRA for the LTX 2.3 model that regenerates a masked region of a video while keeping the rest unchanged. Each training clip is paired with a mask marking the area to regenerate; the model learns to fill that region in a way that blends with the kept pixels. At inference you supply a video and a mask, and the model regenerates only the masked area.

Inpainting operates spatially/temporally over the video, so it is always video-only; audio is unaffected.

Key features:

  • Learns to regenerate a masked region while preserving the unmasked surroundings.
  • Masks can be a single image (applied to every frame) or a video mask (per-frame).
  • Standard mask convention: WHITE = the region to regenerate/edit, BLACK = keep unchanged.
  • Validation previews inpaint a supplied video using a supplied mask.

Dataset Format

Provide a single `.zip` archive (linked via `training_data_url`) where each example is a clip plus its mask:

  • `<name>.<ext>` — the source video (`.mp4`, `.mov`, `.webm`, `.mkv`, `.avi`).
  • `<name>_mask.<ext>` — the mask. Either an image (`.png`, `.jpg`, `.jpeg`, `.bmp`, `.webp`) applied to all frames, or a video mask (per-frame). WHITE marks the region to regenerate; BLACK is kept.
  • `<name>.txt` — optional caption.

Every clip needs a matching `<name>_mask` file. The mask resolution need not match the video; it is resized automatically. A video mask is normalized to its clip's frame count automatically — a shorter mask freeze-holds its last frame, and a longer mask is trimmed. (An image mask is applied to every frame.) File names must be unique across the archive. Aim for at least 10 examples.

Minimum clip length: with `auto_scale_input` off (the default), each video should have at least `number_of_frames` frames (default 89, ≈ 3.7 s at 24 fps). Shorter clips are dropped during preprocessing; if no clip is long enough, the run fails with no usable training data. Turn on `auto_scale_input` to resample shorter clips to the target frame count instead.

Example layout:

clip01.mp4   clip01_mask.png   clip01.txt
clip02.mp4   clip02_mask.mp4   clip02.txt

Video masks require a probeable clip. If a clip paired with a video mask has a frame count that cannot be determined (for example, an unusual or variable-frame-rate encoding), the request is rejected (HTTP 422). Re-encode the clip to a standard format, or supply an image mask instead.

Input Parameters Reference

Dataset
`training_data_url` (required)

Type: `string`

URL to the `.zip` archive of clip + mask examples.

`trigger_phrase`

Type: `string` Default: `""`

Phrase prepended to captions during training; include it at inference.

Training Parameters
`rank`

Type: `integer` (`8`, `16`, `32`, `64`, `128`) Default: `32`

LoRA capacity.

`number_of_steps`

Type: `integer` Default: `2000` (range `100``20000`)

Number of optimization steps.

`learning_rate`

Type: `number` Default: `0.0002`

Optimization step size.

Video Configuration
`number_of_frames`

Type: `integer` Default: `89` (range `9``121`)

Frames per training clip. Must satisfy `frames % 8 == 1`; other values are snapped down to the nearest valid count.

`frame_rate`

Type: `integer` Default: `24` (range `8``60`)

Target frames per second.

`resolution`

Type: `string` (`low`, `medium`, `high`) Default: `medium`

Resolution16:91:19:16
low512×288512×512288×512
medium768×448768×768448×768
high960×544960×960544×960
`aspect_ratio`

Type: `string` (`16:9`, `1:1`, `9:16`) Default: `1:1`

`auto_scale_input`

Type: `boolean` Default: `false`

Fit videos to the target frame count and frame rate.

`split_input_into_scenes`

Type: `boolean` Default: `false`

Off for inpainting: scene splitting would desync a clip from its mask. Provide pre-split clips instead.

`split_input_duration_threshold`

Type: `number` Default: `30.0` (range `1.0``60.0`)

Duration threshold for scene splitting (only relevant if splitting were enabled).

Validation
`validation`

Type: `array` Default: `[]` (max 2 entries)

Validation samples, each an object with:

  • `prompt` (`string`) — the text prompt.
  • `video_url` (`string`, required) — the source video to inpaint.
  • `mask_url` (`string`, required) — the mask (image or video). WHITE = regenerate, BLACK = keep. Resolution is resized automatically.
`validation_negative_prompt`

Type: `string` Default: a built-in quality negative prompt.

`validation_number_of_frames`

Type: `integer` Default: `89` (range `9``121`)

`validation_frame_rate`

Type: `integer` Default: `24` (range `8``60`)

`validation_resolution`

Type: `string` Default: `high`

`validation_aspect_ratio`

Type: `string` Default: `1:1`

`stg_scale`

Type: `number` Default: `1.0` (range `0.0``3.0`)

`debug_dataset`

Type: `boolean` Default: `false`

Return an archive of the preprocessed data for inspection.

Outputs

  • `lora_file` — the trained LoRA weights (`.safetensors`).
  • `config_file` — JSON describing the trigger phrase and training type.
  • `video` — combined validation reel (when validation samples were provided).
  • `debug_dataset` — preprocessed-data archive, only when `debug_dataset` is enabled.

Billing

A successful run is billed `max(100, number_of_steps)` billable units. Requests that fail before training completes (input-validation errors / HTTP 422, or dataset-download failures) are billed 0 units.

How the Training Works

Pipeline Overview
  1. Preprocessing — the archive is extracted, clips matched to their masks and captions, masks converted to the internal convention, and a video mask normalized to its clip's frame count.
  2. Training — the model regenerates the masked region while the unmasked pixels are held as conditioning. Validation previews run at intervals.
  3. Output — the LoRA, config, and validation reel are uploaded.
What Happens to Your Data
  • Archive extraction: the `.zip` is unpacked; macOS metadata and hidden files are ignored.
  • Clip/mask matching: each `<name>` clip is paired with its `<name>_mask` file and optional `<name>.txt`. A clip without a mask causes a clear error.
  • Mask handling: your WHITE = edit / BLACK = keep mask is converted to the internal convention automatically. Image masks are applied to every frame; a video mask is normalized to its clip's frame count (a shorter mask freeze-holds its last frame, a longer mask is trimmed).
  • Video fitting: clips are resized to fill the resolution bucket and center-cropped.
  • Captions: the trigger phrase (if set) is prepended.

Tips for Getting Good Results

Dataset Quality
  • Use at least 10 clip+mask examples representative of the regions you want regenerated.
  • Make masks cleanly cover the area to edit, with a little margin so edges blend.
  • Keep the kept (BLACK) region exactly as you want it preserved.
Mask Best Practices
  • Remember: WHITE = regenerate, BLACK = keep.
  • For a static region, a single image mask is simplest. For a moving region, use a per-frame video mask.
  • Masks are resized to match the video automatically; you do not need to match resolutions.
Caption Best Practices
  • Describe what should appear in the regenerated region (and the overall scene), optionally with a trigger phrase.
  • Keep captions consistent across examples.

Good caption: `a person walking, with the logo on their shirt replaced by tronlog0` Weak caption: `person`

Scene Splitting

Scene splitting is off because it would desync a clip from its mask. If you have long footage, pre-split the clips (and their masks) before uploading.

Inference Format Matching

At inference, supply a video and a mask in the same WHITE=edit convention, and use the same caption style and trigger phrase.

json
{
  "training_data_url": "https://example.com/inpaint_examples.zip",
  "trigger_phrase": "",
  "rank": 32,
  "number_of_steps": 2000,
  "learning_rate": 0.0002,
  "number_of_frames": 89,
  "frame_rate": 24,
  "resolution": "medium",
  "aspect_ratio": "1:1",
  "validation": [
    {
      "prompt": "a person walking",
      "video_url": "https://example.com/source.mp4",
      "mask_url": "https://example.com/mask.png"
    }
  ]
}
Diagnosing Issues
  • Regenerated region does not blend: add more examples; make masks cover the edit area cleanly with a small margin; describe the desired content in captions.
  • Wrong area edited: check mask polarity (WHITE = regenerate). Inverted masks edit the wrong region.
  • Mask/clip desync: avoid scene splitting; provide matched, pre-split clips and masks.
  • Overfitting: fewer steps, lower `rank`, more examples.
Validation Prompt Tips
  • Use a fresh video + mask to gauge generalization.
  • Describe the content you expect in the masked region.
Common Pitfalls
  • Inverted mask polarity (BLACK where you meant WHITE).
  • Missing `<name>_mask` files.
  • Relying on scene splitting (disabled here).