fal-ai/ltx23-trainer-v2/av2av-masked
Input
Hint: Upload a prepared .zip archive or provide a URL. See the field description for the required file layout.
Customize your input with more control.
The cost of training depends on the number of steps. The formula is: 0.0068 * steps. With 1000 steps, your request will cost $6.80.
Training history
Nothing here yet...
Fine-tune your training parameters and start right now.
LTX 2.3 Trainer — Masked Audio+Video Transformation (`/av2av-masked`)
Overview
The `/av2av-masked` endpoint trains a LoRA for the LTX 2.3 model that regenerates the masked region of a target video (guided by the kept pixels and a video reference) while jointly generating audio from an audio reference. It combines masked video-to-video editing with audio generation. You provide triplets — a reference clip (video + audio), a target clip (video + audio), and a mask — and the LoRA learns the masked, reference-guided audiovisual transformation.
Key features:
- Trains a LoRA that regenerates a masked video region (guided by kept pixels + a video reference) and jointly generates audio from an audio reference.
- Standard mask convention: WHITE = the region to regenerate, BLACK = keep unchanged.
- Masks can be a single image (every frame) or a video mask (per-frame).
- Validation previews produce a combined video that carries the generated audio.
Dataset Format
Provide a single `.zip` archive (linked via `training_data_url`) of triplets sharing a base name, where the reference and target both carry audio:
`<name>_start.<ext>`— the reference clip (video + audio). Its video guides the regenerated region; its audio track is the audio reference.`<name>_end.<ext>`— the target clip (video + audio). Its audio is the audio generation target.`<name>_mask.<ext>`— the mask. Either an image (`.png`,`.jpg`,`.jpeg`,`.bmp`,`.webp`) applied to all frames, or a video mask (per-frame). WHITE marks the region to regenerate; BLACK is kept.`<name>.txt`— optional caption.
Video formats: `.mp4`, `.mov`, `.webm`, `.mkv`, `.avi`. Both the reference and target of every triplet must contain an audio track — silent clips are rejected. Every target needs a matching `_start` and `_mask`. A video mask is normalized to its clip's frame count automatically — a shorter mask freeze-holds its last frame, and a longer mask is trimmed. (An image mask is applied to every frame.) File names must be unique across the archive. Aim for at least 10 triplets.
Minimum clip length: with `auto_scale_input` off (the default), the target (`_end`) and reference (`_start`) clips of each triplet must each have at least `number_of_frames` frames (default 89, ≈ 3.7 s at 24 fps). A triplet with either clip too short is skipped, and if none qualifies the request fails (422). Turn on `auto_scale_input` to resample shorter clips instead.
Example layout:
sample01_start.mp4 sample01_end.mp4 sample01_mask.png sample01.txt sample02_start.mp4 sample02_end.mp4 sample02_mask.mp4 sample02.txt
Input Parameters Reference
Dataset
`training_data_url` (required)
Type: `string`
URL to the `.zip` archive of reference/target/mask triplets (with audio).
`trigger_phrase`
Type: `string`
Default: `""`
Phrase prepended to captions during training; include it at inference.
Training Parameters
`rank`
Type: `integer` (`8`, `16`, `32`, `64`, `128`)
Default: `32`
LoRA capacity.
`number_of_steps`
Type: `integer`
Default: `2000` (range `100`–`20000`)
Number of optimization steps.
`learning_rate`
Type: `number`
Default: `0.0002`
Optimization step size.
Video Configuration
`number_of_frames`
Type: `integer`
Default: `89` (range `9`–`121`)
Frames per training clip. Must satisfy `frames % 8 == 1`; other values are snapped down to the nearest valid count.
`frame_rate`
Type: `integer`
Default: `24` (range `8`–`60`)
Target frames per second.
`resolution`
Type: `string` (`low`, `medium`, `high`)
Default: `medium`
| Resolution | 16:9 | 1:1 | 9:16 |
|---|---|---|---|
| low | 512×288 | 512×512 | 288×512 |
| medium | 768×448 | 768×768 | 448×768 |
| high | 960×544 | 960×960 | 544×960 |
`aspect_ratio`
Type: `string` (`16:9`, `1:1`, `9:16`)
Default: `1:1`
`auto_scale_input`
Type: `boolean`
Default: `false`
Fit videos to the target frame count and frame rate.
`split_input_into_scenes`
Type: `boolean`
Default: `false`
Off for this mode: scene splitting would desync the reference/target/mask triplet. Provide pre-split clips instead.
`split_input_duration_threshold`
Type: `number`
Default: `30.0` (range `1.0`–`60.0`)
Duration threshold for scene splitting (only relevant if splitting were enabled).
Audio Configuration
`audio_normalize`
Type: `boolean`
Default: `true`
Peak-normalize audio for consistent loudness across the dataset.
`audio_preserve_pitch`
Type: `boolean`
Default: `true`
Preserve pitch when fitting audio to video duration.
Validation
`validation`
Type: `array`
Default: `[]` (max 2 entries)
Validation samples, each an object with:
`prompt`(`string`) — the text prompt.`video_url`(`string`, required) — the source video; its unmasked region is kept pixel-faithful and the masked region is regenerated.`mask_url`(`string`, required) — the mask (image or video). WHITE = regenerate, BLACK = keep.`reference_video_url`(`string`, required) — the reference video (its audio track is also used as the audio reference) guiding the regenerated audio+video.`reference_audio_url`(`string`, optional) — a separate reference audio; if omitted, the audio is taken from the reference video's own track — in which case that reference video must contain an audio track, or the validation sample is rejected with a 422.
`validation_negative_prompt`
Type: `string`
Default: a built-in quality negative prompt.
`validation_number_of_frames`
Type: `integer`
Default: `89` (range `9`–`121`)
`validation_frame_rate`
Type: `integer`
Default: `24` (range `8`–`60`)
`validation_resolution`
Type: `string`
Default: `high`
`validation_aspect_ratio`
Type: `string`
Default: `1:1`
`stg_scale`
Type: `number`
Default: `1.0` (range `0.0`–`3.0`)
`debug_dataset`
Type: `boolean`
Default: `false`
Return an archive of the preprocessed data for inspection.
Outputs
`lora_file`— the trained LoRA weights (`.safetensors`).`config_file`— JSON describing the trigger phrase and training type.`video`— combined validation reel; the preview video carries the generated audio track.`debug_dataset`— preprocessed-data archive, only when`debug_dataset`is enabled.
Billing
A successful run is billed `max(100, number_of_steps)` billable units. Requests that fail before training completes (input-validation errors / HTTP 422, or dataset-download failures) are billed 0 units.
How the Training Works
Pipeline Overview
- Preprocessing — the archive is extracted, reference/target/mask triplets and captions matched, both clips verified to have audio, masks converted to the internal convention (a video mask normalized to the target's frame count), and clips fit to the resolution bucket. The target's audio is extracted from its own track; the reference's audio from the reference clip.
- Training — the model regenerates the masked video region (guided by kept pixels + the video reference) and jointly generates audio (conditioned on the audio reference). Validation previews run at intervals.
- Output — the LoRA, config, and validation reel (with generated audio) are uploaded.
What Happens to Your Data
- Archive extraction: the
`.zip`is unpacked; macOS metadata and hidden files are ignored. - Triplet matching: each
`<name>_end`target is paired with its`<name>_start`reference,`<name>_mask`, and optional`<name>.txt`. Any missing piece causes a clear error. - Audio requirement: both clips of every triplet must carry audio; silent triplets are rejected.
- Mask handling: your WHITE = edit / BLACK = keep mask is converted to the internal convention automatically. Image masks apply to every frame; video masks are aligned to the target's frame count.
- Video fitting: reference and target are resized to fill the resolution bucket and center-cropped, staying aligned.
- Captions: the trigger phrase (if set) is prepended.
How It Works
This LoRA is trained to perform a reference-conditioned transformation: instead of generating from text alone, it conditions on a reference clip supplied at inference and produces a transformed result. The trained file specializes the base model on your reference→target mapping. Here it respects a mask (only the masked video region is regenerated) and jointly generates audio from an audio reference. At inference you supply a source video, a mask, and a reference video (and optionally a separate reference audio) plus a prompt.
Tips for Getting Good Results
Dataset Quality
- Use at least 10 well-aligned triplets, each with clean, in-sync audio.
- Make masks cleanly cover the edit area with a small margin so edges blend.
- Reference and target should be aligned so the references clearly drive the regenerated content.
Mask Best Practices
- Remember: WHITE = regenerate, BLACK = keep.
- Use an image mask for a static region; a video mask for a moving region.
- Masks are resized to match automatically.
Caption Best Practices
- Describe the regenerated content, the overall scene, and the desired sound plainly, optionally with a trigger phrase.
- Keep captions consistent across triplets.
Good caption: `repl4ce the screen content with rolling waves and ocean sounds`
Weak caption: `screen`
Inference Format Matching
At inference, supply a source video, a mask (WHITE=edit), and a reference video (optionally a separate reference audio), and use the same caption style and trigger phrase.
Recommended Starting Configuration
json{ "training_data_url": "https://example.com/av2av_masked_triplets.zip", "trigger_phrase": "repl4ce", "rank": 32, "number_of_steps": 2000, "learning_rate": 0.0002, "number_of_frames": 89, "frame_rate": 24, "resolution": "medium", "aspect_ratio": "1:1", "validation": [ { "prompt": "repl4ce the screen with ocean waves", "video_url": "https://example.com/source.mp4", "mask_url": "https://example.com/mask.png", "reference_video_url": "https://example.com/ref.mp4" } ] }
Diagnosing Issues
- Regenerated region or audio ignores the references: ensure triplets are aligned and the references clearly relate to the regenerated content; add more examples.
- Wrong area edited: check mask polarity (WHITE = regenerate).
- Dataset rejected for silence: both clips of every triplet must contain audio.
- Mask/clip desync: avoid scene splitting; provide matched, pre-split triplets.
- Overfitting: fewer steps, lower
`rank`, more triplets.
Validation Prompt Tips
- Use a fresh source/mask/reference set to gauge generalization.
- Describe the content and sound you expect in the regenerated region.
Common Pitfalls
- Silent reference or target clips (rejected).
- Inverted mask polarity.
- Missing a
`_start`,`_end`, or`_mask`for a triplet. - Relying on scene splitting (disabled here).