fal-ai/ltx23-trainer-v2/ic-lora/v2v-masked
Input
Hint: Upload a prepared .zip archive or provide a URL. See the field description for the required file layout.
Customize your input with more control.
The cost of training depends on the number of steps. The formula is: 0.0055 * steps. With 1000 steps, your request will cost $5.50.
Training history
Nothing here yet...
Fine-tune your training parameters and start right now.
LTX 2.3 Trainer — Masked Video-to-Video IC-LoRA (`/ic-lora/v2v-masked`)
Overview
The `/ic-lora/v2v-masked` endpoint trains an IC-LoRA (In-Context LoRA) for the LTX 2.3 model that regenerates only the masked region of a target video, guided by both the kept (unmasked) pixels and a separate reference/control video. An IC-LoRA is a small adapter that conditions on references supplied at inference rather than generating from text alone; this variant adds a mask so the transformation stays localized. It combines inpainting (regenerate just the masked area) with video-to-video control (a reference clip steers what goes there). You teach it that mapping by providing triplets — a reference clip, a target clip, and a mask — and the LoRA learns the masked, reference-guided transformation.
This is the right endpoint when you want a localized, control-driven edit, for example:
- Replace a region of footage with content guided by a reference, while everything outside the mask stays untouched.
- Swap or restyle a specific object/area driven by a control clip.
- Any masked, reference-guided video edit you can demonstrate with paired examples and masks.
Key features:
- Trains an IC-LoRA that regenerates a masked region using both the kept pixels and a reference video.
- Standard mask convention: WHITE = the region to regenerate, BLACK = keep unchanged.
- Masks can be a single image (every frame) or a video mask (per-frame).
- Video-only (no audio is learned).
Dataset Format
Provide a single `.zip` archive (linked via `training_data_url`) of triplets sharing a base name:
`<name>_start.<ext>`— the reference / control video that guides the regenerated region.`<name>_end.<ext>`— the target video.`<name>_mask.<ext>`— the mask. Either an image (`.png`,`.jpg`,`.jpeg`,`.bmp`,`.webp`) applied to all frames, or a video mask (per-frame). WHITE marks the region to regenerate; BLACK is kept.`<name>.txt`— optional caption.
Video formats: `.mp4`, `.mov`, `.webm`, `.mkv`, `.avi`. Every target needs both a matching `_start` reference and a `_mask`. A video mask is normalized to its clip's frame count automatically — a shorter mask freeze-holds its last frame, and a longer mask is trimmed. (An image mask is applied to every frame.) File names must be unique across the archive. Aim for at least 10 triplets.
Minimum clip length: with `auto_scale_input` off (the default), the target (`_end`) and reference (`_start`) clips of each triplet must each have at least `number_of_frames` frames (default 89, ≈ 3.7 s at 24 fps). A triplet with either clip too short is skipped, and if none qualifies the request fails (422). Turn on `auto_scale_input` to resample shorter clips instead.
Example layout:
sample01_start.mp4 sample01_end.mp4 sample01_mask.png sample01.txt sample02_start.mp4 sample02_end.mp4 sample02_mask.mp4 sample02.txt
Input Parameters Reference
Dataset
`training_data_url` (required)
Type: `string`
URL to the `.zip` archive of reference/target/mask triplets.
`trigger_phrase`
Type: `string`
Default: `""`
Phrase prepended to captions during training; include it at inference.
Training Parameters
`rank`
Type: `integer` (`8`, `16`, `32`, `64`, `128`)
Default: `32`
IC-LoRA capacity.
`number_of_steps`
Type: `integer`
Default: `3000` (range `100`–`20000`)
Number of optimization steps. This composite transformation typically benefits from a higher step count, hence the higher default.
`learning_rate`
Type: `number`
Default: `0.0002`
Optimization step size.
Video Configuration
`number_of_frames`
Type: `integer`
Default: `89` (range `9`–`121`)
Frames per training clip. Must satisfy `frames % 8 == 1`; other values are snapped down to the nearest valid count.
`frame_rate`
Type: `integer`
Default: `24` (range `8`–`60`)
Target frames per second.
`resolution`
Type: `string` (`low`, `medium`, `high`)
Default: `medium`
| Resolution | 16:9 | 1:1 | 9:16 |
|---|---|---|---|
| low | 512×288 | 512×512 | 288×512 |
| medium | 768×448 | 768×768 | 448×768 |
| high | 960×544 | 960×960 | 544×960 |
`aspect_ratio`
Type: `string` (`16:9`, `1:1`, `9:16`)
Default: `1:1`
`auto_scale_input`
Type: `boolean`
Default: `false`
Fit videos to the target frame count and frame rate.
`split_input_into_scenes`
Type: `boolean`
Default: `false`
Off for this mode: scene splitting would desync the reference/target/mask triplet. Provide pre-split clips instead.
`split_input_duration_threshold`
Type: `number`
Default: `30.0` (range `1.0`–`60.0`)
Duration threshold for scene splitting (only relevant if splitting were enabled).
Validation
`validation`
Type: `array`
Default: `[]` (max 2 entries)
Validation samples, each an object with:
`prompt`(`string`) — the text prompt.`video_url`(`string`, required) — the source video; its unmasked region is kept pixel-faithful and the masked region is regenerated.`mask_url`(`string`, required) — the mask (image or video). WHITE = regenerate, BLACK = keep.`reference_video_url`(`string`, required) — the reference (control) video guiding the regenerated region.
`validation_negative_prompt`
Type: `string`
Default: a built-in quality negative prompt.
`validation_number_of_frames`
Type: `integer`
Default: `89` (range `9`–`121`)
`validation_frame_rate`
Type: `integer`
Default: `24` (range `8`–`60`)
`validation_resolution`
Type: `string`
Default: `high`
`validation_aspect_ratio`
Type: `string`
Default: `1:1`
`stg_scale`
Type: `number`
Default: `1.0` (range `0.0`–`3.0`)
`debug_dataset`
Type: `boolean`
Default: `false`
Return an archive of the preprocessed data for inspection.
Outputs
`lora_file`— the trained IC-LoRA weights (`.safetensors`).`config_file`— JSON describing the trigger phrase and training type.`video`— combined validation reel (when validation samples were provided).`debug_dataset`— preprocessed-data archive, only when`debug_dataset`is enabled.
Billing
A successful run is billed `max(100, number_of_steps)` billable units. Requests that fail before training completes (input-validation errors / HTTP 422, or dataset-download failures) are billed 0 units.
How the Training Works
Pipeline Overview
- Preprocessing — the archive is extracted, reference/target/mask triplets and captions matched, masks converted to the internal convention (a video mask normalized to the target's frame count), and clips fit to the resolution bucket.
- Training — the model regenerates only the masked region of the target, guided by both the kept pixels and the reference video; the unmasked region is held as conditioning. Validation previews run at intervals.
- Output — the IC-LoRA, config, and validation reel are uploaded.
What Happens to Your Data
- Archive extraction: the
`.zip`is unpacked; macOS metadata and hidden files are ignored. - Triplet matching: each
`<name>_end`target is paired with its`<name>_start`reference,`<name>_mask`, and optional`<name>.txt`. Any missing piece causes a clear error. - Mask handling: your WHITE = edit / BLACK = keep mask is converted to the internal convention automatically. Image masks apply to every frame; video masks are aligned to the target's frame count.
- Video fitting: reference and target are resized to fill the resolution bucket and center-cropped, staying aligned.
- Captions: the trigger phrase (if set) is prepended.
What an IC-LoRA Is
An IC-LoRA performs an in-context transformation using a reference supplied at inference. Here it also respects a mask: only the masked region is regenerated, guided by both the surrounding kept pixels and the reference clip. The trained file specializes the base model at your masked, reference-guided edit. At inference you supply a source video, a mask, and a reference video plus a prompt, and the LoRA regenerates only the masked region.
Tips for Getting Good Results
Dataset Quality
- Use at least 10 well-aligned triplets representative of the masked, reference-guided edit you want.
- Make masks cleanly cover the edit area with a small margin so edges blend.
- Reference and target should be aligned so the reference clearly drives the masked region.
Mask Best Practices
- Remember: WHITE = regenerate, BLACK = keep.
- Use an image mask for a static region; a video mask for a moving region.
- Masks are resized to match automatically.
Caption Best Practices
- Describe what should appear in the regenerated region and the overall scene, optionally with a trigger phrase.
- Keep captions consistent across triplets.
Good caption: `repl4ce the billboard content with a sunset landscape`
Weak caption: `billboard`
Inference Format Matching
At inference, supply a source video, a mask (WHITE=edit), and a reference video, and use the same caption style and trigger phrase.
Recommended Starting Configuration
json{ "training_data_url": "https://example.com/v2v_masked_triplets.zip", "trigger_phrase": "repl4ce", "rank": 32, "number_of_steps": 3000, "learning_rate": 0.0002, "number_of_frames": 89, "frame_rate": 24, "resolution": "medium", "aspect_ratio": "1:1", "validation": [ { "prompt": "repl4ce the billboard with a sunset", "video_url": "https://example.com/source.mp4", "mask_url": "https://example.com/mask.png", "reference_video_url": "https://example.com/ref.mp4" } ] }
Diagnosing Issues
- Regenerated region ignores the reference: ensure triplets are aligned and the reference clearly relates to the masked content; add more examples.
- Wrong area edited: check mask polarity (WHITE = regenerate).
- Mask/clip desync: avoid scene splitting; provide matched, pre-split triplets.
- Overfitting: fewer steps, lower
`rank`, more triplets.
Validation Prompt Tips
- Use a fresh source/mask/reference set to gauge generalization.
- Describe the content you expect in the masked region.
Common Pitfalls
- Inverted mask polarity.
- Missing a
`_start`,`_end`, or`_mask`for a triplet. - Relying on scene splitting (disabled here).