LTX 2.3 Trainer (V2) - Masked Video-to-Video (Training) API on fal

LTX 2.3 Trainer — Masked Video-to-Video (`/v2v-masked`)

Overview

The `/v2v-masked` endpoint trains a LoRA for the LTX 2.3 model that regenerates only the masked region of a target video, guided by both the kept (unmasked) pixels and a separate reference/control video. It combines inpainting (regenerate just the masked area) with video-to-video control (a reference clip steers what goes there). You provide triplets — a reference clip, a target clip, and a mask — and the LoRA learns the masked, reference-guided transformation.

This is the right endpoint when you want a localized, control-driven edit: replace a region of footage with content guided by a reference, while everything outside the mask stays untouched.

Key features:

Trains a LoRA that regenerates a masked region using both the kept pixels and a reference video.
Standard mask convention: WHITE = the region to regenerate, BLACK = keep unchanged.
Masks can be a single image (every frame) or a video mask (per-frame).
Video-only (no audio is learned).

Dataset Format

Provide a single `.zip` archive (linked via `training_data_url`) of triplets sharing a base name:

`<name>_start.<ext>` — the reference / control video that guides the regenerated region.
`<name>_end.<ext>` — the target video.
`<name>_mask.<ext>` — the mask. Either an image (`.png`, `.jpg`, `.jpeg`, `.bmp`, `.webp`) applied to all frames, or a video mask (per-frame). WHITE marks the region to regenerate; BLACK is kept.
`<name>.txt` — optional caption.

Video formats: `.mp4`, `.mov`, `.webm`, `.mkv`, `.avi`. Every target needs both a matching `_start` reference and a `_mask`. A video mask is normalized to its clip's frame count automatically — a shorter mask freeze-holds its last frame, and a longer mask is trimmed. (An image mask is applied to every frame.) File names must be unique across the archive. Aim for at least 10 triplets.

Minimum clip length: with `auto_scale_input` off (the default), the target (`_end`) and reference (`_start`) clips of each triplet must each have at least `number_of_frames` frames (default 89, ≈ 3.7 s at 24 fps). A triplet with either clip too short is skipped, and if none qualifies the request fails (422). Turn on `auto_scale_input` to resample shorter clips instead.

Example layout:


sample01_start.mp4   sample01_end.mp4   sample01_mask.png   sample01.txt
sample02_start.mp4   sample02_end.mp4   sample02_mask.mp4   sample02.txt

Input Parameters Reference

Dataset

`training_data_url` (required)

Type: `string`

URL to the `.zip` archive of reference/target/mask triplets.

`trigger_phrase`

Type: `string` Default: `""`

Phrase prepended to captions during training; include it at inference.

Training Parameters

`rank`

Type: `integer` (`8`, `16`, `32`, `64`, `128`) Default: `32`

LoRA capacity.

`number_of_steps`

Type: `integer` Default: `3000` (range `100`–`20000`)

Number of optimization steps. This composite transformation typically benefits from a higher step count, hence the higher default.

`learning_rate`

Type: `number` Default: `0.0002`

Optimization step size.

Video Configuration

`number_of_frames`

Type: `integer` Default: `89` (range `9`–`121`)

Frames per training clip. Must satisfy `frames % 8 == 1`; other values are snapped down to the nearest valid count.

`frame_rate`

Type: `integer` Default: `24` (range `8`–`60`)

Target frames per second.

`resolution`

Type: `string` (`low`, `medium`, `high`) Default: `medium`

Resolution	16:9	1:1	9:16
low	512×288	512×512	288×512
medium	768×448	768×768	448×768
high	960×544	960×960	544×960

`aspect_ratio`

Type: `string` (`16:9`, `1:1`, `9:16`) Default: `1:1`

`auto_scale_input`

Type: `boolean` Default: `false`

Fit videos to the target frame count and frame rate.

`split_input_into_scenes`

Type: `boolean` Default: `false`

Off for this mode: scene splitting would desync the reference/target/mask triplet. Provide pre-split clips instead.

`split_input_duration_threshold`

Type: `number` Default: `30.0` (range `1.0`–`60.0`)

Duration threshold for scene splitting (only relevant if splitting were enabled).

Validation

`validation`

Type: `array` Default: `[]` (max 2 entries)

Validation samples, each an object with:

`prompt` (`string`) — the text prompt.
`video_url` (`string`, required) — the source video; its unmasked region is kept pixel-faithful and the masked region is regenerated.
`mask_url` (`string`, required) — the mask (image or video). WHITE = regenerate, BLACK = keep.
`reference_video_url` (`string`, required) — the reference (control) video guiding the regenerated region.

`validation_negative_prompt`

Type: `string` Default: a built-in quality negative prompt.

`validation_number_of_frames`

Type: `integer` Default: `89` (range `9`–`121`)

`validation_frame_rate`

Type: `integer` Default: `24` (range `8`–`60`)

`validation_resolution`

Type: `string` Default: `high`

`validation_aspect_ratio`

Type: `string` Default: `1:1`

`stg_scale`

Type: `number` Default: `1.0` (range `0.0`–`3.0`)

`debug_dataset`

Type: `boolean` Default: `false`

Return an archive of the preprocessed data for inspection.

Outputs

`lora_file` — the trained LoRA weights (`.safetensors`).
`config_file` — JSON describing the trigger phrase and training type.
`video` — combined validation reel (when validation samples were provided).
`debug_dataset` — preprocessed-data archive, only when `debug_dataset` is enabled.

Billing

A successful run is billed `max(100, number_of_steps)` billable units. Requests that fail before training completes (input-validation errors / HTTP 422, or dataset-download failures) are billed 0 units.

How the Training Works

Pipeline Overview

Preprocessing — the archive is extracted, reference/target/mask triplets and captions matched, masks converted to the internal convention (a video mask normalized to the target's frame count), and clips fit to the resolution bucket.
Training — the model regenerates only the masked region of the target, guided by both the kept pixels and the reference video; the unmasked region is held as conditioning. Validation previews run at intervals.
Output — the LoRA, config, and validation reel are uploaded.

What Happens to Your Data

Archive extraction: the `.zip` is unpacked; macOS metadata and hidden files are ignored.
Triplet matching: each `<name>_end` target is paired with its `<name>_start` reference, `<name>_mask`, and optional `<name>.txt`. Any missing piece causes a clear error.
Mask handling: your WHITE = edit / BLACK = keep mask is converted to the internal convention automatically. Image masks apply to every frame; video masks are aligned to the target's frame count.
Video fitting: reference and target are resized to fill the resolution bucket and center-cropped, staying aligned.
Captions: the trigger phrase (if set) is prepended.

How It Works

This is a LoRA trained to perform a reference-conditioned transformation: instead of generating from text alone, it conditions on a reference clip supplied at inference and produces a transformed result. The trained file specializes the base model on your reference→target mapping. Here it also respects a mask: only the masked region is regenerated, guided by both the surrounding kept pixels and the reference clip. At inference you supply a source video, a mask, and a reference video plus a prompt.

Tips for Getting Good Results

Dataset Quality

Use at least 10 well-aligned triplets representative of the masked, reference-guided edit you want.
Make masks cleanly cover the edit area with a small margin so edges blend.
Reference and target should be aligned so the reference clearly drives the masked region.

Mask Best Practices

Remember: WHITE = regenerate, BLACK = keep.
Use an image mask for a static region; a video mask for a moving region.
Masks are resized to match automatically.

Caption Best Practices

Describe what should appear in the regenerated region and the overall scene, optionally with a trigger phrase.
Keep captions consistent across triplets.

Good caption: `repl4ce the billboard content with a sunset landscape` Weak caption: `billboard`

Inference Format Matching

At inference, supply a source video, a mask (WHITE=edit), and a reference video, and use the same caption style and trigger phrase.

Recommended Starting Configuration

json
{
  "training_data_url": "https://example.com/v2v_masked_triplets.zip",
  "trigger_phrase": "repl4ce",
  "rank": 32,
  "number_of_steps": 3000,
  "learning_rate": 0.0002,
  "number_of_frames": 89,
  "frame_rate": 24,
  "resolution": "medium",
  "aspect_ratio": "1:1",
  "validation": [
    {
      "prompt": "repl4ce the billboard with a sunset",
      "video_url": "https://example.com/source.mp4",
      "mask_url": "https://example.com/mask.png",
      "reference_video_url": "https://example.com/ref.mp4"
    }
  ]
}

Diagnosing Issues

Regenerated region ignores the reference: ensure triplets are aligned and the reference clearly relates to the masked content; add more examples.
Wrong area edited: check mask polarity (WHITE = regenerate).
Mask/clip desync: avoid scene splitting; provide matched, pre-split triplets.
Overfitting: fewer steps, lower `rank`, more triplets.

Validation Prompt Tips

Use a fresh source/mask/reference set to gauge generalization.
Describe the content you expect in the masked region.

Common Pitfalls

Inverted mask polarity.
Missing a `_start`, `_end`, or `_mask` for a triplet.
Relying on scene splitting (disabled here).

fal-ai/ltx23-trainer-v2/v2v-masked

Input

Training history

Nothing here yet...

LTX 2.3 Trainer — Masked Video-to-Video (`/v2v-masked`)

Overview

Dataset Format

Input Parameters Reference

Dataset

`training_data_url` (required)

`trigger_phrase`

Training Parameters

`rank`

`number_of_steps`

`learning_rate`

Video Configuration

`number_of_frames`

`frame_rate`

`resolution`

`aspect_ratio`

`auto_scale_input`

`split_input_into_scenes`

`split_input_duration_threshold`

Validation

`validation`

`validation_negative_prompt`

`validation_number_of_frames`

`validation_frame_rate`

`validation_resolution`

`validation_aspect_ratio`

`stg_scale`

`debug_dataset`

Outputs

Billing

How the Training Works

Pipeline Overview

What Happens to Your Data

How It Works

Tips for Getting Good Results

Dataset Quality

Mask Best Practices

Caption Best Practices

Inference Format Matching

Recommended Starting Configuration

Diagnosing Issues

Validation Prompt Tips

Common Pitfalls