LTX 2.3 Trainer (V2) - Masked Video-to-Video IC-LoRA (Training) API on fal

LTX 2.3 Trainer — Masked Video-to-Video IC-LoRA (`/ic-lora/v2v-masked`)

Overview

The `/ic-lora/v2v-masked` endpoint trains an IC-LoRA (In-Context LoRA) for the LTX 2.3 model that regenerates only the masked region of a target video, guided by both the kept (unmasked) pixels and a separate reference/control video. An IC-LoRA is a small adapter that conditions on references supplied at inference rather than generating from text alone; this variant adds a mask so the transformation stays localized. It combines inpainting (regenerate just the masked area) with video-to-video control (a reference clip steers what goes there). You teach it that mapping by providing triplets — a reference clip, a target clip, and a mask — and the LoRA learns the masked, reference-guided transformation.

This is the right endpoint when you want a localized, control-driven edit, for example:

Replace a region of footage with content guided by a reference, while everything outside the mask stays untouched.
Swap or restyle a specific object/area driven by a control clip.
Any masked, reference-guided video edit you can demonstrate with paired examples and masks.

Key features:

Trains an IC-LoRA that regenerates a masked region using both the kept pixels and a reference video.
Standard mask convention: WHITE = the region to regenerate, BLACK = keep unchanged.
Masks can be a single image (every frame) or a video mask (per-frame).
Video-only (no audio is learned).

Dataset Format

Provide a single `.zip` archive (linked via `training_data_url`) of triplets sharing a base name:

`<name>_start.<ext>` — the reference / control video that guides the regenerated region.
`<name>_end.<ext>` — the target video.
`<name>_mask.<ext>` — the mask. Either an image (`.png`, `.jpg`, `.jpeg`, `.bmp`, `.webp`) applied to all frames, or a video mask (per-frame). WHITE marks the region to regenerate; BLACK is kept.
`<name>.txt` — optional caption.

Video formats: `.mp4`, `.mov`, `.webm`, `.mkv`, `.avi`. Every target needs both a matching `_start` reference and a `_mask`. A video mask is normalized to its clip's frame count automatically — a shorter mask freeze-holds its last frame, and a longer mask is trimmed. (An image mask is applied to every frame.) File names must be unique across the archive. Aim for at least 10 triplets.

Minimum clip length: with `auto_scale_input` off (the default), the target (`_end`) and reference (`_start`) clips of each triplet must each have at least `number_of_frames` frames (default 89, ≈ 3.7 s at 24 fps). A triplet with either clip too short is skipped, and if none qualifies the request fails (422). Turn on `auto_scale_input` to resample shorter clips instead.

Example layout:


sample01_start.mp4   sample01_end.mp4   sample01_mask.png   sample01.txt
sample02_start.mp4   sample02_end.mp4   sample02_mask.mp4   sample02.txt

Input Parameters Reference

Dataset

`training_data_url` (required)

Type: `string`

URL to the `.zip` archive of reference/target/mask triplets.

`trigger_phrase`

Type: `string` Default: `""`

Phrase prepended to captions during training; include it at inference.

Training Parameters

`rank`

Type: `integer` (`8`, `16`, `32`, `64`, `128`) Default: `32`

IC-LoRA capacity.

`number_of_steps`

Type: `integer` Default: `3000` (range `100`–`20000`)

Number of optimization steps. This composite transformation typically benefits from a higher step count, hence the higher default.

`learning_rate`

Type: `number` Default: `0.0002`

Optimization step size.

Video Configuration

`number_of_frames`

Type: `integer` Default: `89` (range `9`–`121`)

Frames per training clip. Must satisfy `frames % 8 == 1`; other values are snapped down to the nearest valid count.

`frame_rate`

Type: `integer` Default: `24` (range `8`–`60`)

Target frames per second.

`resolution`

Type: `string` (`low`, `medium`, `high`) Default: `medium`

Resolution	16:9	1:1	9:16
low	512×288	512×512	288×512
medium	768×448	768×768	448×768
high	960×544	960×960	544×960

`aspect_ratio`

Type: `string` (`16:9`, `1:1`, `9:16`) Default: `1:1`

`auto_scale_input`

Type: `boolean` Default: `false`

Fit videos to the target frame count and frame rate.

`split_input_into_scenes`

Type: `boolean` Default: `false`

Off for this mode: scene splitting would desync the reference/target/mask triplet. Provide pre-split clips instead.

`split_input_duration_threshold`

Type: `number` Default: `30.0` (range `1.0`–`60.0`)

Duration threshold for scene splitting (only relevant if splitting were enabled).

Validation

`validation`

Type: `array` Default: `[]` (max 2 entries)

Validation samples, each an object with:

`prompt` (`string`) — the text prompt.
`video_url` (`string`, required) — the source video; its unmasked region is kept pixel-faithful and the masked region is regenerated.
`mask_url` (`string`, required) — the mask (image or video). WHITE = regenerate, BLACK = keep.
`reference_video_url` (`string`, required) — the reference (control) video guiding the regenerated region.

`validation_negative_prompt`

Type: `string` Default: a built-in quality negative prompt.

`validation_number_of_frames`

Type: `integer` Default: `89` (range `9`–`121`)

`validation_frame_rate`

Type: `integer` Default: `24` (range `8`–`60`)

`validation_resolution`

Type: `string` Default: `high`

`validation_aspect_ratio`

Type: `string` Default: `1:1`

`stg_scale`

Type: `number` Default: `1.0` (range `0.0`–`3.0`)

`debug_dataset`

Type: `boolean` Default: `false`

Return an archive of the preprocessed data for inspection.

Outputs

`lora_file` — the trained IC-LoRA weights (`.safetensors`).
`config_file` — JSON describing the trigger phrase and training type.
`video` — combined validation reel (when validation samples were provided).
`debug_dataset` — preprocessed-data archive, only when `debug_dataset` is enabled.

Billing

A successful run is billed `max(100, number_of_steps)` billable units. Requests that fail before training completes (input-validation errors / HTTP 422, or dataset-download failures) are billed 0 units.

How the Training Works

Pipeline Overview

Preprocessing — the archive is extracted, reference/target/mask triplets and captions matched, masks converted to the internal convention (a video mask normalized to the target's frame count), and clips fit to the resolution bucket.
Training — the model regenerates only the masked region of the target, guided by both the kept pixels and the reference video; the unmasked region is held as conditioning. Validation previews run at intervals.
Output — the IC-LoRA, config, and validation reel are uploaded.

What Happens to Your Data

Archive extraction: the `.zip` is unpacked; macOS metadata and hidden files are ignored.
Triplet matching: each `<name>_end` target is paired with its `<name>_start` reference, `<name>_mask`, and optional `<name>.txt`. Any missing piece causes a clear error.
Mask handling: your WHITE = edit / BLACK = keep mask is converted to the internal convention automatically. Image masks apply to every frame; video masks are aligned to the target's frame count.
Video fitting: reference and target are resized to fill the resolution bucket and center-cropped, staying aligned.
Captions: the trigger phrase (if set) is prepended.

What an IC-LoRA Is

An IC-LoRA performs an in-context transformation using a reference supplied at inference. Here it also respects a mask: only the masked region is regenerated, guided by both the surrounding kept pixels and the reference clip. The trained file specializes the base model at your masked, reference-guided edit. At inference you supply a source video, a mask, and a reference video plus a prompt, and the LoRA regenerates only the masked region.

Tips for Getting Good Results

Dataset Quality

Use at least 10 well-aligned triplets representative of the masked, reference-guided edit you want.
Make masks cleanly cover the edit area with a small margin so edges blend.
Reference and target should be aligned so the reference clearly drives the masked region.

Mask Best Practices

Remember: WHITE = regenerate, BLACK = keep.
Use an image mask for a static region; a video mask for a moving region.
Masks are resized to match automatically.

Caption Best Practices

Describe what should appear in the regenerated region and the overall scene, optionally with a trigger phrase.
Keep captions consistent across triplets.

Good caption: `repl4ce the billboard content with a sunset landscape` Weak caption: `billboard`

Inference Format Matching

At inference, supply a source video, a mask (WHITE=edit), and a reference video, and use the same caption style and trigger phrase.

Recommended Starting Configuration

json
{
  "training_data_url": "https://example.com/v2v_masked_triplets.zip",
  "trigger_phrase": "repl4ce",
  "rank": 32,
  "number_of_steps": 3000,
  "learning_rate": 0.0002,
  "number_of_frames": 89,
  "frame_rate": 24,
  "resolution": "medium",
  "aspect_ratio": "1:1",
  "validation": [
    {
      "prompt": "repl4ce the billboard with a sunset",
      "video_url": "https://example.com/source.mp4",
      "mask_url": "https://example.com/mask.png",
      "reference_video_url": "https://example.com/ref.mp4"
    }
  ]
}

Diagnosing Issues

Regenerated region ignores the reference: ensure triplets are aligned and the reference clearly relates to the masked content; add more examples.
Wrong area edited: check mask polarity (WHITE = regenerate).
Mask/clip desync: avoid scene splitting; provide matched, pre-split triplets.
Overfitting: fewer steps, lower `rank`, more triplets.

Validation Prompt Tips

Use a fresh source/mask/reference set to gauge generalization.
Describe the content you expect in the masked region.

Common Pitfalls

Inverted mask polarity.
Missing a `_start`, `_end`, or `_mask` for a triplet.
Relying on scene splitting (disabled here).

fal-ai/ltx23-trainer-v2/ic-lora/v2v-masked

Input

Training history

Nothing here yet...

LTX 2.3 Trainer — Masked Video-to-Video IC-LoRA (`/ic-lora/v2v-masked`)

Overview

Dataset Format

Input Parameters Reference

Dataset

`training_data_url` (required)

`trigger_phrase`

Training Parameters

`rank`

`number_of_steps`

`learning_rate`

Video Configuration

`number_of_frames`

`frame_rate`

`resolution`

`aspect_ratio`

`auto_scale_input`

`split_input_into_scenes`

`split_input_duration_threshold`

Validation

`validation`

`validation_negative_prompt`

`validation_number_of_frames`

`validation_frame_rate`

`validation_resolution`

`validation_aspect_ratio`

`stg_scale`

`debug_dataset`

Outputs

Billing

How the Training Works

Pipeline Overview

What Happens to Your Data

What an IC-LoRA Is

Tips for Getting Good Results

Dataset Quality

Mask Best Practices

Caption Best Practices

Inference Format Matching

Recommended Starting Configuration

Diagnosing Issues

Validation Prompt Tips

Common Pitfalls