LTX 2.3 Trainer (V2) - Audio+Video Reference Transformation (Training) API on fal

LTX 2.3 Trainer — Audio+Video Reference Transformation (`/av2av`)

Overview

The `/av2av` endpoint trains a LoRA for the LTX 2.3 model that performs a combined audio+video → audio+video transformation. It is conditioned on a reference clip (both its video and its audio track) and learns to produce a corresponding target clip (with both generated video and generated audio). You provide pairs of clips — a reference and a target, each carrying audio — and the LoRA learns the joint mapping.

Use this endpoint when you want a control-driven transformation across both modalities at once: restyling a video and its sound together, applying a recurring audiovisual edit, and similar reference-to-(audio+video) tasks.

Key features:

Trains a LoRA from paired reference→target clips, each with audio.
Both video and audio are generated, each conditioned on its own reference.
Optional reference video downscaling / temporal downsampling so the LoRA can be driven by a coarse / low-FPS video control proxy.
Validation previews transform a supplied reference clip (video + audio) into a combined video preview that carries the generated audio.

Dataset Format

Provide a single `.zip` archive (linked via `training_data_url`) of paired clips, each with an audio track:

`<name>_start.<ext>` — the reference clip (video + audio).
`<name>_end.<ext>` — the target clip (video + audio).
`<name>.txt` — optional caption.

Video formats: `.mp4`, `.mov`, `.webm`, `.mkv`, `.avi`. Both the reference and the target of every pair must contain an audio track — silent clips are rejected with a clear error. Each `_start` must have a matching `_end`. File names must be unique across the archive. Aim for at least 10 pairs.

Minimum clip length: with `auto_scale_input` off (the default), both clips in each pair (`_start` and `_end`) must already have at least `number_of_frames` frames (default 89, ≈ 3.7 s at 24 fps). A pair with either side too short is skipped, and if no pair qualifies the request fails (422). Turn on `auto_scale_input` to resample shorter clips instead.

Example layout:


sample01_start.mp4   sample01_end.mp4   sample01.txt
sample02_start.mp4   sample02_end.mp4   sample02.txt

Input Parameters Reference

Dataset

`training_data_url` (required)

Type: `string`

URL to the `.zip` archive of `_start`/`_end` clip pairs (with audio).

`trigger_phrase`

Type: `string` Default: `""`

Phrase prepended to captions during training; include it at inference.

Training Parameters

`rank`

Type: `integer` (`8`, `16`, `32`, `64`, `128`) Default: `32`

LoRA capacity.

`number_of_steps`

Type: `integer` Default: `2000` (range `100`–`20000`)

Number of optimization steps.

`learning_rate`

Type: `number` Default: `0.0002`

Optimization step size.

`reference_downscale_factor`

Type: `integer` Default: `1` (range `1`–`8`)

Spatially downscale the reference video by this factor before encoding (coarse/low-resolution control proxy → full-resolution output). `1` means no downscaling. Both width and height must be divisible by the factor, and width ÷ factor and height ÷ factor must each be divisible by 32 (checked against both the training and validation resolutions); an incompatible value fails the request with a 422. (Applies to the video reference only; audio has no spatial knob.)

`reference_temporal_scale_factor`

Type: `integer` Default: `1` (range `1`–`8`)

Temporally downsample the reference video (lower FPS) by this factor before encoding. `1` means no change. `(number_of_frames − 1)` must be divisible by the factor, and after subsampling `(frames − 1)` must remain a multiple of 8 — checked against both `number_of_frames` and `validation_number_of_frames`. An incompatible factor/frame-count combination fails the request with a 422. (Example: with the default 89 frames, a factor of 2 is invalid because `(89 − 1) ÷ 2 = 44`, which is not a multiple of 8.)

Video Configuration

`number_of_frames`

Type: `integer` Default: `89` (range `9`–`121`)

Frames per training clip. Must satisfy `frames % 8 == 1`; other values are snapped down to the nearest valid count.

`frame_rate`

Type: `integer` Default: `24` (range `8`–`60`)

Target frames per second.

`resolution`

Type: `string` (`low`, `medium`, `high`) Default: `medium`

Resolution	16:9	1:1	9:16
low	512×288	512×512	288×512
medium	768×448	768×768	448×768
high	960×544	960×960	544×960

`aspect_ratio`

Type: `string` (`16:9`, `1:1`, `9:16`) Default: `1:1`

`auto_scale_input`

Type: `boolean` Default: `false`

Fit videos to the target frame count and frame rate.

`split_input_into_scenes`

Type: `boolean` Default: `false`

Off for this mode: scene splitting would desync the reference/target pair. Provide pre-split clips instead.

`split_input_duration_threshold`

Type: `number` Default: `30.0` (range `1.0`–`60.0`)

Duration threshold for scene splitting (only relevant if splitting were enabled).

Audio Configuration

`audio_normalize`

Type: `boolean` Default: `true`

Peak-normalize audio for consistent loudness across the dataset.

`audio_preserve_pitch`

Type: `boolean` Default: `true`

Preserve pitch when fitting audio to video duration.

Validation

`validation`

Type: `array` Default: `[]` (max 2 entries)

Validation samples, each an object with:

`prompt` (`string`) — the text prompt.
`reference_video_url` (`string`, required) — the reference video (its audio track is also used as the audio reference) that conditions the generated audio+video.
`reference_audio_url` (`string`, optional) — a separate reference audio; if omitted, the audio is taken from the reference video's own track — in which case that reference video must contain an audio track, or the validation sample is rejected with a 422.

`validation_negative_prompt`

Type: `string` Default: a built-in quality negative prompt.

`validation_number_of_frames`

Type: `integer` Default: `89` (range `9`–`121`)

`validation_frame_rate`

Type: `integer` Default: `24` (range `8`–`60`)

`validation_resolution`

Type: `string` Default: `high`

`validation_aspect_ratio`

Type: `string` Default: `1:1`

`stg_scale`

Type: `number` Default: `1.0` (range `0.0`–`3.0`)

`debug_dataset`

Type: `boolean` Default: `false`

Return an archive of the preprocessed data for inspection.

Outputs

`lora_file` — the trained LoRA weights (`.safetensors`).
`config_file` — JSON describing the trigger phrase and training type.
`video` — combined validation reel; the preview video carries the generated audio track.
`debug_dataset` — preprocessed-data archive, only when `debug_dataset` is enabled.

Billing

A successful run is billed `max(100, number_of_steps)` billable units. Requests that fail before training completes (input-validation errors / HTTP 422, or dataset-download failures) are billed 0 units.

How the Training Works

Pipeline Overview

Preprocessing — the archive is extracted, `_start`/`_end` pairs and captions matched, both clips verified to have audio, clips fit to the resolution bucket, and the video reference optionally downscaled/temporally downsampled. The target's audio is extracted from its own track; the reference's audio is extracted from the reference clip.
Training — the LoRA trains for `number_of_steps`. Both modalities are generated: the video conditioned on the reference video, the audio conditioned on the reference audio. Validation previews run at intervals.
Output — the LoRA, config, and validation reel (with generated audio) are uploaded.

What Happens to Your Data

Archive extraction: the `.zip` is unpacked; macOS metadata and hidden files are ignored.
Pair matching: each `<name>_start` is paired with its `<name>_end` and the optional `<name>.txt`. Targets missing a reference cause a clear error.
Audio requirement: both clips of every pair must carry audio; silent pairs are rejected.
Video fitting: both clips are resized to fill the resolution bucket and center-cropped, staying aligned.
Reference scaling: when set above 1, the reference video is downscaled/temporally downsampled before encoding so the LoRA learns to drive full-resolution output from a coarse reference.
Captions: the trigger phrase (if set) is prepended.

How It Works

The LoRA is trained to perform a reference-conditioned transformation: instead of generating from text alone, it conditions on a reference clip supplied at inference and produces a transformed result — here, both video and audio together. The trained file specializes the base model on your reference→target mapping. At inference you supply a reference video (and optionally a separate reference audio) plus a prompt.

Tips for Getting Good Results

Dataset Quality

Use at least 10 well-aligned reference→target pairs, each with clean, in-sync audio.
Reference and target should describe the same scene/sound differing only by the transformation you want learned.

Caption Best Practices

Describe the target's content and sound plainly, optionally with a trigger phrase.
Keep captions consistent in style across pairs.

Good caption: `av_st9le a neon-lit street with a synthwave soundtrack` Weak caption: `street`

Trigger Phrases

A distinctive trigger phrase helps invoke the transformation cleanly; include it in every caption and at inference.

Reference Scaling

To drive generation from a coarse video control map, set `reference_downscale_factor` above 1 and supply matching coarse references at inference; the validation localizer mirrors training-time scaling for you.

Inference Format Matching

At inference, supply a reference video (and optionally a separate reference audio), use the same trigger phrase, and apply the same reference scaling you trained with.

Recommended Starting Configuration

json
{
  "training_data_url": "https://example.com/av2av_pairs.zip",
  "trigger_phrase": "av_st9le",
  "rank": 32,
  "number_of_steps": 2000,
  "learning_rate": 0.0002,
  "reference_downscale_factor": 1,
  "reference_temporal_scale_factor": 1,
  "number_of_frames": 89,
  "frame_rate": 24,
  "resolution": "medium",
  "aspect_ratio": "1:1",
  "validation": [
    { "prompt": "av_st9le a neon-lit street", "reference_video_url": "https://example.com/ref.mp4" }
  ]
}

Diagnosing Issues

Transformation too weak: more steps, higher `rank`, or more consistent pairs.
Audio or video ignores the reference: ensure pairs are well aligned and differ only by the intended transformation; check reference scaling matches at inference.
Dataset rejected for silence: both clips of every pair must contain audio.
Overfitting: fewer steps, lower `rank`, more pairs.

Validation Prompt Tips

Use a fresh reference clip to gauge generalization.
Match the caption style and trigger phrase you trained with.

Common Pitfalls

Silent reference or target clips (rejected).
Missing or mismatched `_start`/`_end` pairs.
Using different reference scaling at inference than at training.

fal-ai/ltx23-trainer-v2/av2av

Input

Training history

Nothing here yet...

LTX 2.3 Trainer — Audio+Video Reference Transformation (`/av2av`)

Overview

Dataset Format

Input Parameters Reference

Dataset

`training_data_url` (required)

`trigger_phrase`

Training Parameters

`rank`

`number_of_steps`

`learning_rate`

`reference_downscale_factor`

`reference_temporal_scale_factor`

Video Configuration

`number_of_frames`

`frame_rate`

`resolution`

`aspect_ratio`

`auto_scale_input`

`split_input_into_scenes`

`split_input_duration_threshold`

Audio Configuration

`audio_normalize`

`audio_preserve_pitch`

Validation

`validation`

`validation_negative_prompt`

`validation_number_of_frames`

`validation_frame_rate`

`validation_resolution`

`validation_aspect_ratio`

`stg_scale`

`debug_dataset`

Outputs

Billing

How the Training Works

Pipeline Overview

What Happens to Your Data

How It Works

Tips for Getting Good Results

Dataset Quality

Caption Best Practices

Trigger Phrases

Reference Scaling

Inference Format Matching

Recommended Starting Configuration

Diagnosing Issues

Validation Prompt Tips

Common Pitfalls