fal-ai/ltx23-trainer-v2/inpaint
Input
Hint: Upload a prepared .zip archive or provide a URL. See the field description for the required file layout.
Customize your input with more control.
The cost of training depends on the number of steps. The formula is: 0.0024 * steps. With 1000 steps, your request will cost $2.40.
Training history
Nothing here yet...
Fine-tune your training parameters and start right now.
LTX 2.3 Trainer — Video Inpainting (`/inpaint`)
Overview
The `/inpaint` endpoint trains a LoRA for the LTX 2.3 model that regenerates a masked region of a video while keeping the rest unchanged. Each training clip is paired with a mask marking the area to regenerate; the model learns to fill that region in a way that blends with the kept pixels. At inference you supply a video and a mask, and the model regenerates only the masked area.
Inpainting operates spatially/temporally over the video, so it is always video-only; audio is unaffected.
Key features:
- Learns to regenerate a masked region while preserving the unmasked surroundings.
- Masks can be a single image (applied to every frame) or a video mask (per-frame).
- Standard mask convention: WHITE = the region to regenerate/edit, BLACK = keep unchanged.
- Validation previews inpaint a supplied video using a supplied mask.
Dataset Format
Provide a single `.zip` archive (linked via `training_data_url`) where each example is a clip plus its mask:
`<name>.<ext>`— the source video (`.mp4`,`.mov`,`.webm`,`.mkv`,`.avi`).`<name>_mask.<ext>`— the mask. Either an image (`.png`,`.jpg`,`.jpeg`,`.bmp`,`.webp`) applied to all frames, or a video mask (per-frame). WHITE marks the region to regenerate; BLACK is kept.`<name>.txt`— optional caption.
Every clip needs a matching `<name>_mask` file. The mask resolution need not match the video; it is resized automatically. A video mask is normalized to its clip's frame count automatically — a shorter mask freeze-holds its last frame, and a longer mask is trimmed. (An image mask is applied to every frame.) File names must be unique across the archive. Aim for at least 10 examples.
Minimum clip length: with `auto_scale_input` off (the default), each video should have at least `number_of_frames` frames (default 89, ≈ 3.7 s at 24 fps). Shorter clips are dropped during preprocessing; if no clip is long enough, the run fails with no usable training data. Turn on `auto_scale_input` to resample shorter clips to the target frame count instead.
Example layout:
clip01.mp4 clip01_mask.png clip01.txt clip02.mp4 clip02_mask.mp4 clip02.txt
Video masks require a probeable clip. If a clip paired with a video mask has a frame count that cannot be determined (for example, an unusual or variable-frame-rate encoding), the request is rejected (HTTP 422). Re-encode the clip to a standard format, or supply an image mask instead.
Input Parameters Reference
Dataset
`training_data_url` (required)
Type: `string`
URL to the `.zip` archive of clip + mask examples.
`trigger_phrase`
Type: `string`
Default: `""`
Phrase prepended to captions during training; include it at inference.
Training Parameters
`rank`
Type: `integer` (`8`, `16`, `32`, `64`, `128`)
Default: `32`
LoRA capacity.
`number_of_steps`
Type: `integer`
Default: `2000` (range `100`–`20000`)
Number of optimization steps.
`learning_rate`
Type: `number`
Default: `0.0002`
Optimization step size.
Video Configuration
`number_of_frames`
Type: `integer`
Default: `89` (range `9`–`121`)
Frames per training clip. Must satisfy `frames % 8 == 1`; other values are snapped down to the nearest valid count.
`frame_rate`
Type: `integer`
Default: `24` (range `8`–`60`)
Target frames per second.
`resolution`
Type: `string` (`low`, `medium`, `high`)
Default: `medium`
| Resolution | 16:9 | 1:1 | 9:16 |
|---|---|---|---|
| low | 512×288 | 512×512 | 288×512 |
| medium | 768×448 | 768×768 | 448×768 |
| high | 960×544 | 960×960 | 544×960 |
`aspect_ratio`
Type: `string` (`16:9`, `1:1`, `9:16`)
Default: `1:1`
`auto_scale_input`
Type: `boolean`
Default: `false`
Fit videos to the target frame count and frame rate.
`split_input_into_scenes`
Type: `boolean`
Default: `false`
Off for inpainting: scene splitting would desync a clip from its mask. Provide pre-split clips instead.
`split_input_duration_threshold`
Type: `number`
Default: `30.0` (range `1.0`–`60.0`)
Duration threshold for scene splitting (only relevant if splitting were enabled).
Validation
`validation`
Type: `array`
Default: `[]` (max 2 entries)
Validation samples, each an object with:
`prompt`(`string`) — the text prompt.`video_url`(`string`, required) — the source video to inpaint.`mask_url`(`string`, required) — the mask (image or video). WHITE = regenerate, BLACK = keep. Resolution is resized automatically.
`validation_negative_prompt`
Type: `string`
Default: a built-in quality negative prompt.
`validation_number_of_frames`
Type: `integer`
Default: `89` (range `9`–`121`)
`validation_frame_rate`
Type: `integer`
Default: `24` (range `8`–`60`)
`validation_resolution`
Type: `string`
Default: `high`
`validation_aspect_ratio`
Type: `string`
Default: `1:1`
`stg_scale`
Type: `number`
Default: `1.0` (range `0.0`–`3.0`)
`debug_dataset`
Type: `boolean`
Default: `false`
Return an archive of the preprocessed data for inspection.
Outputs
`lora_file`— the trained LoRA weights (`.safetensors`).`config_file`— JSON describing the trigger phrase and training type.`video`— combined validation reel (when validation samples were provided).`debug_dataset`— preprocessed-data archive, only when`debug_dataset`is enabled.
Billing
A successful run is billed `max(100, number_of_steps)` billable units. Requests that fail before training completes (input-validation errors / HTTP 422, or dataset-download failures) are billed 0 units.
How the Training Works
Pipeline Overview
- Preprocessing — the archive is extracted, clips matched to their masks and captions, masks converted to the internal convention, and a video mask normalized to its clip's frame count.
- Training — the model regenerates the masked region while the unmasked pixels are held as conditioning. Validation previews run at intervals.
- Output — the LoRA, config, and validation reel are uploaded.
What Happens to Your Data
- Archive extraction: the
`.zip`is unpacked; macOS metadata and hidden files are ignored. - Clip/mask matching: each
`<name>`clip is paired with its`<name>_mask`file and optional`<name>.txt`. A clip without a mask causes a clear error. - Mask handling: your WHITE = edit / BLACK = keep mask is converted to the internal convention automatically. Image masks are applied to every frame; a video mask is normalized to its clip's frame count (a shorter mask freeze-holds its last frame, a longer mask is trimmed).
- Video fitting: clips are resized to fill the resolution bucket and center-cropped.
- Captions: the trigger phrase (if set) is prepended.
Tips for Getting Good Results
Dataset Quality
- Use at least 10 clip+mask examples representative of the regions you want regenerated.
- Make masks cleanly cover the area to edit, with a little margin so edges blend.
- Keep the kept (BLACK) region exactly as you want it preserved.
Mask Best Practices
- Remember: WHITE = regenerate, BLACK = keep.
- For a static region, a single image mask is simplest. For a moving region, use a per-frame video mask.
- Masks are resized to match the video automatically; you do not need to match resolutions.
Caption Best Practices
- Describe what should appear in the regenerated region (and the overall scene), optionally with a trigger phrase.
- Keep captions consistent across examples.
Good caption: `a person walking, with the logo on their shirt replaced by tronlog0`
Weak caption: `person`
Scene Splitting
Scene splitting is off because it would desync a clip from its mask. If you have long footage, pre-split the clips (and their masks) before uploading.
Inference Format Matching
At inference, supply a video and a mask in the same WHITE=edit convention, and use the same caption style and trigger phrase.
Recommended Starting Configuration
json{ "training_data_url": "https://example.com/inpaint_examples.zip", "trigger_phrase": "", "rank": 32, "number_of_steps": 2000, "learning_rate": 0.0002, "number_of_frames": 89, "frame_rate": 24, "resolution": "medium", "aspect_ratio": "1:1", "validation": [ { "prompt": "a person walking", "video_url": "https://example.com/source.mp4", "mask_url": "https://example.com/mask.png" } ] }
Diagnosing Issues
- Regenerated region does not blend: add more examples; make masks cover the edit area cleanly with a small margin; describe the desired content in captions.
- Wrong area edited: check mask polarity (WHITE = regenerate). Inverted masks edit the wrong region.
- Mask/clip desync: avoid scene splitting; provide matched, pre-split clips and masks.
- Overfitting: fewer steps, lower
`rank`, more examples.
Validation Prompt Tips
- Use a fresh video + mask to gauge generalization.
- Describe the content you expect in the masked region.
Common Pitfalls
- Inverted mask polarity (BLACK where you meant WHITE).
- Missing
`<name>_mask`files. - Relying on scene splitting (disabled here).