fal-ai/ltx23-trainer-v2/v2v-masked

Train a LoRA that regenerates only the masked region of a video, guided by both the kept pixels and a separate reference/control video.
Training
Commercial use

Input

Additional Settings

Customize your input with more control.

The cost of training depends on the number of steps. The formula is: 0.0055 * steps. With 1000 steps, your request will cost $5.50.

Training history

Note: these are the most recent training requests. For the full history, check the requests tab.

LTX 2.3 Trainer — Masked Video-to-Video (`/v2v-masked`)

Overview

The `/v2v-masked` endpoint trains a LoRA for the LTX 2.3 model that regenerates only the masked region of a target video, guided by both the kept (unmasked) pixels and a separate reference/control video. It combines inpainting (regenerate just the masked area) with video-to-video control (a reference clip steers what goes there). You provide triplets — a reference clip, a target clip, and a mask — and the LoRA learns the masked, reference-guided transformation.

This is the right endpoint when you want a localized, control-driven edit: replace a region of footage with content guided by a reference, while everything outside the mask stays untouched.

Key features:

  • Trains a LoRA that regenerates a masked region using both the kept pixels and a reference video.
  • Standard mask convention: WHITE = the region to regenerate, BLACK = keep unchanged.
  • Masks can be a single image (every frame) or a video mask (per-frame).
  • Video-only (no audio is learned).

Dataset Format

Provide a single `.zip` archive (linked via `training_data_url`) of triplets sharing a base name:

  • `<name>_start.<ext>` — the reference / control video that guides the regenerated region.
  • `<name>_end.<ext>` — the target video.
  • `<name>_mask.<ext>` — the mask. Either an image (`.png`, `.jpg`, `.jpeg`, `.bmp`, `.webp`) applied to all frames, or a video mask (per-frame). WHITE marks the region to regenerate; BLACK is kept.
  • `<name>.txt` — optional caption.

Video formats: `.mp4`, `.mov`, `.webm`, `.mkv`, `.avi`. Every target needs both a matching `_start` reference and a `_mask`. A video mask is normalized to its clip's frame count automatically — a shorter mask freeze-holds its last frame, and a longer mask is trimmed. (An image mask is applied to every frame.) File names must be unique across the archive. Aim for at least 10 triplets.

Minimum clip length: with `auto_scale_input` off (the default), the target (`_end`) and reference (`_start`) clips of each triplet must each have at least `number_of_frames` frames (default 89, ≈ 3.7 s at 24 fps). A triplet with either clip too short is skipped, and if none qualifies the request fails (422). Turn on `auto_scale_input` to resample shorter clips instead.

Example layout:

sample01_start.mp4   sample01_end.mp4   sample01_mask.png   sample01.txt
sample02_start.mp4   sample02_end.mp4   sample02_mask.mp4   sample02.txt

Input Parameters Reference

Dataset
`training_data_url` (required)

Type: `string`

URL to the `.zip` archive of reference/target/mask triplets.

`trigger_phrase`

Type: `string` Default: `""`

Phrase prepended to captions during training; include it at inference.

Training Parameters
`rank`

Type: `integer` (`8`, `16`, `32`, `64`, `128`) Default: `32`

LoRA capacity.

`number_of_steps`

Type: `integer` Default: `3000` (range `100``20000`)

Number of optimization steps. This composite transformation typically benefits from a higher step count, hence the higher default.

`learning_rate`

Type: `number` Default: `0.0002`

Optimization step size.

Video Configuration
`number_of_frames`

Type: `integer` Default: `89` (range `9``121`)

Frames per training clip. Must satisfy `frames % 8 == 1`; other values are snapped down to the nearest valid count.

`frame_rate`

Type: `integer` Default: `24` (range `8``60`)

Target frames per second.

`resolution`

Type: `string` (`low`, `medium`, `high`) Default: `medium`

Resolution16:91:19:16
low512×288512×512288×512
medium768×448768×768448×768
high960×544960×960544×960
`aspect_ratio`

Type: `string` (`16:9`, `1:1`, `9:16`) Default: `1:1`

`auto_scale_input`

Type: `boolean` Default: `false`

Fit videos to the target frame count and frame rate.

`split_input_into_scenes`

Type: `boolean` Default: `false`

Off for this mode: scene splitting would desync the reference/target/mask triplet. Provide pre-split clips instead.

`split_input_duration_threshold`

Type: `number` Default: `30.0` (range `1.0``60.0`)

Duration threshold for scene splitting (only relevant if splitting were enabled).

Validation
`validation`

Type: `array` Default: `[]` (max 2 entries)

Validation samples, each an object with:

  • `prompt` (`string`) — the text prompt.
  • `video_url` (`string`, required) — the source video; its unmasked region is kept pixel-faithful and the masked region is regenerated.
  • `mask_url` (`string`, required) — the mask (image or video). WHITE = regenerate, BLACK = keep.
  • `reference_video_url` (`string`, required) — the reference (control) video guiding the regenerated region.
`validation_negative_prompt`

Type: `string` Default: a built-in quality negative prompt.

`validation_number_of_frames`

Type: `integer` Default: `89` (range `9``121`)

`validation_frame_rate`

Type: `integer` Default: `24` (range `8``60`)

`validation_resolution`

Type: `string` Default: `high`

`validation_aspect_ratio`

Type: `string` Default: `1:1`

`stg_scale`

Type: `number` Default: `1.0` (range `0.0``3.0`)

`debug_dataset`

Type: `boolean` Default: `false`

Return an archive of the preprocessed data for inspection.

Outputs

  • `lora_file` — the trained LoRA weights (`.safetensors`).
  • `config_file` — JSON describing the trigger phrase and training type.
  • `video` — combined validation reel (when validation samples were provided).
  • `debug_dataset` — preprocessed-data archive, only when `debug_dataset` is enabled.

Billing

A successful run is billed `max(100, number_of_steps)` billable units. Requests that fail before training completes (input-validation errors / HTTP 422, or dataset-download failures) are billed 0 units.

How the Training Works

Pipeline Overview
  1. Preprocessing — the archive is extracted, reference/target/mask triplets and captions matched, masks converted to the internal convention (a video mask normalized to the target's frame count), and clips fit to the resolution bucket.
  2. Training — the model regenerates only the masked region of the target, guided by both the kept pixels and the reference video; the unmasked region is held as conditioning. Validation previews run at intervals.
  3. Output — the LoRA, config, and validation reel are uploaded.
What Happens to Your Data
  • Archive extraction: the `.zip` is unpacked; macOS metadata and hidden files are ignored.
  • Triplet matching: each `<name>_end` target is paired with its `<name>_start` reference, `<name>_mask`, and optional `<name>.txt`. Any missing piece causes a clear error.
  • Mask handling: your WHITE = edit / BLACK = keep mask is converted to the internal convention automatically. Image masks apply to every frame; video masks are aligned to the target's frame count.
  • Video fitting: reference and target are resized to fill the resolution bucket and center-cropped, staying aligned.
  • Captions: the trigger phrase (if set) is prepended.
How It Works

This is a LoRA trained to perform a reference-conditioned transformation: instead of generating from text alone, it conditions on a reference clip supplied at inference and produces a transformed result. The trained file specializes the base model on your reference→target mapping. Here it also respects a mask: only the masked region is regenerated, guided by both the surrounding kept pixels and the reference clip. At inference you supply a source video, a mask, and a reference video plus a prompt.

Tips for Getting Good Results

Dataset Quality
  • Use at least 10 well-aligned triplets representative of the masked, reference-guided edit you want.
  • Make masks cleanly cover the edit area with a small margin so edges blend.
  • Reference and target should be aligned so the reference clearly drives the masked region.
Mask Best Practices
  • Remember: WHITE = regenerate, BLACK = keep.
  • Use an image mask for a static region; a video mask for a moving region.
  • Masks are resized to match automatically.
Caption Best Practices
  • Describe what should appear in the regenerated region and the overall scene, optionally with a trigger phrase.
  • Keep captions consistent across triplets.

Good caption: `repl4ce the billboard content with a sunset landscape` Weak caption: `billboard`

Inference Format Matching

At inference, supply a source video, a mask (WHITE=edit), and a reference video, and use the same caption style and trigger phrase.

json
{
  "training_data_url": "https://example.com/v2v_masked_triplets.zip",
  "trigger_phrase": "repl4ce",
  "rank": 32,
  "number_of_steps": 3000,
  "learning_rate": 0.0002,
  "number_of_frames": 89,
  "frame_rate": 24,
  "resolution": "medium",
  "aspect_ratio": "1:1",
  "validation": [
    {
      "prompt": "repl4ce the billboard with a sunset",
      "video_url": "https://example.com/source.mp4",
      "mask_url": "https://example.com/mask.png",
      "reference_video_url": "https://example.com/ref.mp4"
    }
  ]
}
Diagnosing Issues
  • Regenerated region ignores the reference: ensure triplets are aligned and the reference clearly relates to the masked content; add more examples.
  • Wrong area edited: check mask polarity (WHITE = regenerate).
  • Mask/clip desync: avoid scene splitting; provide matched, pre-split triplets.
  • Overfitting: fewer steps, lower `rank`, more triplets.
Validation Prompt Tips
  • Use a fresh source/mask/reference set to gauge generalization.
  • Describe the content you expect in the masked region.
Common Pitfalls
  • Inverted mask polarity.
  • Missing a `_start`, `_end`, or `_mask` for a triplet.
  • Relying on scene splitting (disabled here).