LTX 2.3 Trainer (V2) - Masked Audio+Video Transformation (Training) API on fal

LTX 2.3 Trainer — Masked Audio+Video Transformation (`/av2av-masked`)

Overview

The `/av2av-masked` endpoint trains a LoRA for the LTX 2.3 model that regenerates the masked region of a target video (guided by the kept pixels and a video reference) while jointly generating audio from an audio reference. It combines masked video-to-video editing with audio generation. You provide triplets — a reference clip (video + audio), a target clip (video + audio), and a mask — and the LoRA learns the masked, reference-guided audiovisual transformation.

Key features:

Trains a LoRA that regenerates a masked video region (guided by kept pixels + a video reference) and jointly generates audio from an audio reference.
Standard mask convention: WHITE = the region to regenerate, BLACK = keep unchanged.
Masks can be a single image (every frame) or a video mask (per-frame).
Validation previews produce a combined video that carries the generated audio.

Dataset Format

Provide a single `.zip` archive (linked via `training_data_url`) of triplets sharing a base name, where the reference and target both carry audio:

`<name>_start.<ext>` — the reference clip (video + audio). Its video guides the regenerated region; its audio track is the audio reference.
`<name>_end.<ext>` — the target clip (video + audio). Its audio is the audio generation target.
`<name>_mask.<ext>` — the mask. Either an image (`.png`, `.jpg`, `.jpeg`, `.bmp`, `.webp`) applied to all frames, or a video mask (per-frame). WHITE marks the region to regenerate; BLACK is kept.
`<name>.txt` — optional caption.

Video formats: `.mp4`, `.mov`, `.webm`, `.mkv`, `.avi`. Both the reference and target of every triplet must contain an audio track — silent clips are rejected. Every target needs a matching `_start` and `_mask`. A video mask is normalized to its clip's frame count automatically — a shorter mask freeze-holds its last frame, and a longer mask is trimmed. (An image mask is applied to every frame.) File names must be unique across the archive. Aim for at least 10 triplets.

Minimum clip length: with `auto_scale_input` off (the default), the target (`_end`) and reference (`_start`) clips of each triplet must each have at least `number_of_frames` frames (default 89, ≈ 3.7 s at 24 fps). A triplet with either clip too short is skipped, and if none qualifies the request fails (422). Turn on `auto_scale_input` to resample shorter clips instead.

Example layout:


sample01_start.mp4   sample01_end.mp4   sample01_mask.png   sample01.txt
sample02_start.mp4   sample02_end.mp4   sample02_mask.mp4   sample02.txt

Input Parameters Reference

Dataset

`training_data_url` (required)

Type: `string`

URL to the `.zip` archive of reference/target/mask triplets (with audio).

`trigger_phrase`

Type: `string` Default: `""`

Phrase prepended to captions during training; include it at inference.

Training Parameters

`rank`

Type: `integer` (`8`, `16`, `32`, `64`, `128`) Default: `32`

LoRA capacity.

`number_of_steps`

Type: `integer` Default: `2000` (range `100`–`20000`)

Number of optimization steps.

`learning_rate`

Type: `number` Default: `0.0002`

Optimization step size.

Video Configuration

`number_of_frames`

Type: `integer` Default: `89` (range `9`–`121`)

Frames per training clip. Must satisfy `frames % 8 == 1`; other values are snapped down to the nearest valid count.

`frame_rate`

Type: `integer` Default: `24` (range `8`–`60`)

Target frames per second.

`resolution`

Type: `string` (`low`, `medium`, `high`) Default: `medium`

Resolution	16:9	1:1	9:16
low	512×288	512×512	288×512
medium	768×448	768×768	448×768
high	960×544	960×960	544×960

`aspect_ratio`

Type: `string` (`16:9`, `1:1`, `9:16`) Default: `1:1`

`auto_scale_input`

Type: `boolean` Default: `false`

Fit videos to the target frame count and frame rate.

`split_input_into_scenes`

Type: `boolean` Default: `false`

Off for this mode: scene splitting would desync the reference/target/mask triplet. Provide pre-split clips instead.

`split_input_duration_threshold`

Type: `number` Default: `30.0` (range `1.0`–`60.0`)

Duration threshold for scene splitting (only relevant if splitting were enabled).

Audio Configuration

`audio_normalize`

Type: `boolean` Default: `true`

Peak-normalize audio for consistent loudness across the dataset.

`audio_preserve_pitch`

Type: `boolean` Default: `true`

Preserve pitch when fitting audio to video duration.

Validation

`validation`

Type: `array` Default: `[]` (max 2 entries)

Validation samples, each an object with:

`prompt` (`string`) — the text prompt.
`video_url` (`string`, required) — the source video; its unmasked region is kept pixel-faithful and the masked region is regenerated.
`mask_url` (`string`, required) — the mask (image or video). WHITE = regenerate, BLACK = keep.
`reference_video_url` (`string`, required) — the reference video (its audio track is also used as the audio reference) guiding the regenerated audio+video.
`reference_audio_url` (`string`, optional) — a separate reference audio; if omitted, the audio is taken from the reference video's own track — in which case that reference video must contain an audio track, or the validation sample is rejected with a 422.

`validation_negative_prompt`

Type: `string` Default: a built-in quality negative prompt.

`validation_number_of_frames`

Type: `integer` Default: `89` (range `9`–`121`)

`validation_frame_rate`

Type: `integer` Default: `24` (range `8`–`60`)

`validation_resolution`

Type: `string` Default: `high`

`validation_aspect_ratio`

Type: `string` Default: `1:1`

`stg_scale`

Type: `number` Default: `1.0` (range `0.0`–`3.0`)

`debug_dataset`

Type: `boolean` Default: `false`

Return an archive of the preprocessed data for inspection.

Outputs

`lora_file` — the trained LoRA weights (`.safetensors`).
`config_file` — JSON describing the trigger phrase and training type.
`video` — combined validation reel; the preview video carries the generated audio track.
`debug_dataset` — preprocessed-data archive, only when `debug_dataset` is enabled.

Billing

A successful run is billed `max(100, number_of_steps)` billable units. Requests that fail before training completes (input-validation errors / HTTP 422, or dataset-download failures) are billed 0 units.

How the Training Works

Pipeline Overview

Preprocessing — the archive is extracted, reference/target/mask triplets and captions matched, both clips verified to have audio, masks converted to the internal convention (a video mask normalized to the target's frame count), and clips fit to the resolution bucket. The target's audio is extracted from its own track; the reference's audio from the reference clip.
Training — the model regenerates the masked video region (guided by kept pixels + the video reference) and jointly generates audio (conditioned on the audio reference). Validation previews run at intervals.
Output — the LoRA, config, and validation reel (with generated audio) are uploaded.

What Happens to Your Data

Archive extraction: the `.zip` is unpacked; macOS metadata and hidden files are ignored.
Triplet matching: each `<name>_end` target is paired with its `<name>_start` reference, `<name>_mask`, and optional `<name>.txt`. Any missing piece causes a clear error.
Audio requirement: both clips of every triplet must carry audio; silent triplets are rejected.
Mask handling: your WHITE = edit / BLACK = keep mask is converted to the internal convention automatically. Image masks apply to every frame; video masks are aligned to the target's frame count.
Video fitting: reference and target are resized to fill the resolution bucket and center-cropped, staying aligned.
Captions: the trigger phrase (if set) is prepended.

How It Works

This LoRA is trained to perform a reference-conditioned transformation: instead of generating from text alone, it conditions on a reference clip supplied at inference and produces a transformed result. The trained file specializes the base model on your reference→target mapping. Here it respects a mask (only the masked video region is regenerated) and jointly generates audio from an audio reference. At inference you supply a source video, a mask, and a reference video (and optionally a separate reference audio) plus a prompt.

Tips for Getting Good Results

Dataset Quality

Use at least 10 well-aligned triplets, each with clean, in-sync audio.
Make masks cleanly cover the edit area with a small margin so edges blend.
Reference and target should be aligned so the references clearly drive the regenerated content.

Mask Best Practices

Remember: WHITE = regenerate, BLACK = keep.
Use an image mask for a static region; a video mask for a moving region.
Masks are resized to match automatically.

Caption Best Practices

Describe the regenerated content, the overall scene, and the desired sound plainly, optionally with a trigger phrase.
Keep captions consistent across triplets.

Good caption: `repl4ce the screen content with rolling waves and ocean sounds` Weak caption: `screen`

Inference Format Matching

At inference, supply a source video, a mask (WHITE=edit), and a reference video (optionally a separate reference audio), and use the same caption style and trigger phrase.

Recommended Starting Configuration

json
{
  "training_data_url": "https://example.com/av2av_masked_triplets.zip",
  "trigger_phrase": "repl4ce",
  "rank": 32,
  "number_of_steps": 2000,
  "learning_rate": 0.0002,
  "number_of_frames": 89,
  "frame_rate": 24,
  "resolution": "medium",
  "aspect_ratio": "1:1",
  "validation": [
    {
      "prompt": "repl4ce the screen with ocean waves",
      "video_url": "https://example.com/source.mp4",
      "mask_url": "https://example.com/mask.png",
      "reference_video_url": "https://example.com/ref.mp4"
    }
  ]
}

Diagnosing Issues

Regenerated region or audio ignores the references: ensure triplets are aligned and the references clearly relate to the regenerated content; add more examples.
Wrong area edited: check mask polarity (WHITE = regenerate).
Dataset rejected for silence: both clips of every triplet must contain audio.
Mask/clip desync: avoid scene splitting; provide matched, pre-split triplets.
Overfitting: fewer steps, lower `rank`, more triplets.

Validation Prompt Tips

Use a fresh source/mask/reference set to gauge generalization.
Describe the content and sound you expect in the regenerated region.

Common Pitfalls

Silent reference or target clips (rejected).
Inverted mask polarity.
Missing a `_start`, `_end`, or `_mask` for a triplet.
Relying on scene splitting (disabled here).

fal-ai/ltx23-trainer-v2/av2av-masked

Input

Training history

Nothing here yet...

LTX 2.3 Trainer — Masked Audio+Video Transformation (`/av2av-masked`)

Overview

Dataset Format

Input Parameters Reference

Dataset

`training_data_url` (required)

`trigger_phrase`

Training Parameters

`rank`

`number_of_steps`

`learning_rate`

Video Configuration

`number_of_frames`

`frame_rate`

`resolution`

`aspect_ratio`

`auto_scale_input`

`split_input_into_scenes`

`split_input_duration_threshold`

Audio Configuration

`audio_normalize`

`audio_preserve_pitch`

Validation

`validation`

`validation_negative_prompt`

`validation_number_of_frames`

`validation_frame_rate`

`validation_resolution`

`validation_aspect_ratio`

`stg_scale`

`debug_dataset`

Outputs

Billing

How the Training Works

Pipeline Overview

What Happens to Your Data

How It Works

Tips for Getting Good Results

Dataset Quality

Mask Best Practices

Caption Best Practices

Inference Format Matching

Recommended Starting Configuration

Diagnosing Issues

Validation Prompt Tips

Common Pitfalls