LTX 2.3 Trainer (V2) - Audio+Video Reference IC-LoRA (Training) API on fal

LTX 2.3 Trainer — Audio+Video Reference IC-LoRA (`/ic-lora/av2av`)

Overview

The `/ic-lora/av2av` endpoint trains an IC-LoRA (In-Context LoRA) for the LTX 2.3 model that performs a combined audio+video → audio+video transformation. An IC-LoRA is a small adapter that does not generate from text alone — instead it conditions on a reference clip supplied at inference (both its video and its audio track) and learns to produce a corresponding target clip with both generated video and generated audio. You teach it that joint mapping by providing pairs of clips — a reference and a target, each carrying audio — and the LoRA learns to go from one to the other across both modalities at once.

Use this endpoint when you want a control-driven transformation across both modalities together, for example:

Restyling a video and its sound at the same time.
Applying a recurring audiovisual edit that touches both picture and audio.
Any reference-to-(audio+video) mapping you can demonstrate with paired examples.

Key features:

Trains an IC-LoRA from paired reference→target clips, each with audio.
Both video and audio are generated, each conditioned on its own reference.
Optional reference video downscaling / temporal downsampling so the LoRA can be driven by a coarse / low-FPS video control proxy.
Validation previews transform a supplied reference clip (video + audio) into a combined video preview that carries the generated audio.

Dataset Format

Provide a single `.zip` archive (linked via `training_data_url`) of paired clips, each with an audio track:

`<name>_start.<ext>` — the reference clip (video + audio).
`<name>_end.<ext>` — the target clip (video + audio).
`<name>.txt` — optional caption.

Video formats: `.mp4`, `.mov`, `.webm`, `.mkv`, `.avi`. Both the reference and the target of every pair must contain an audio track — silent clips are rejected with a clear error. Each `_start` must have a matching `_end`. File names must be unique across the archive. Aim for at least 10 pairs.

Minimum clip length: with `auto_scale_input` off (the default), both clips in each pair (`_start` and `_end`) must already have at least `number_of_frames` frames (default 89, ≈ 3.7 s at 24 fps). A pair with either side too short is skipped, and if no pair qualifies the request fails (422). Turn on `auto_scale_input` to resample shorter clips instead.

Example layout:


sample01_start.mp4   sample01_end.mp4   sample01.txt
sample02_start.mp4   sample02_end.mp4   sample02.txt

Input Parameters Reference

Dataset

`training_data_url` (required)

Type: `string`

URL to the `.zip` archive of `_start`/`_end` clip pairs (with audio).

`trigger_phrase`

Type: `string` Default: `""`

Phrase prepended to captions during training; include it at inference.

Training Parameters

`rank`

Type: `integer` (`8`, `16`, `32`, `64`, `128`) Default: `32`

IC-LoRA capacity.

`number_of_steps`

Type: `integer` Default: `2000` (range `100`–`20000`)

Number of optimization steps.

`learning_rate`

Type: `number` Default: `0.0002`

Optimization step size.

`reference_downscale_factor`

Type: `integer` Default: `1` (range `1`–`8`)

Spatially downscale the reference video by this factor before encoding (coarse / low-resolution control proxy → full-resolution output). `1` means no downscaling. Both width and height must be divisible by the factor, and width ÷ factor and height ÷ factor must each be divisible by 32 (checked against both the training and validation resolutions); an incompatible value fails the request with a 422. (Applies to the video reference only; audio has no spatial knob.)

`reference_temporal_scale_factor`

Type: `integer` Default: `1` (range `1`–`8`)

Temporally downsample the reference video (lower FPS) by this factor before encoding. `1` means no change. `(number_of_frames − 1)` must be divisible by the factor, and after subsampling `(frames − 1)` must remain a multiple of 8 — checked against both `number_of_frames` and `validation_number_of_frames`. An incompatible factor/frame-count combination fails the request with a 422. (Example: with the default 89 frames, a factor of 2 is invalid because `(89 − 1) ÷ 2 = 44`, which is not a multiple of 8.)

Video Configuration

`number_of_frames`

Type: `integer` Default: `89` (range `9`–`121`)

Frames per training clip. Must satisfy `frames % 8 == 1`; other values are snapped down to the nearest valid count.

`frame_rate`

Type: `integer` Default: `24` (range `8`–`60`)

Target frames per second.

`resolution`

Type: `string` (`low`, `medium`, `high`) Default: `medium`

Resolution	16:9	1:1	9:16
low	512×288	512×512	288×512
medium	768×448	768×768	448×768
high	960×544	960×960	544×960

`aspect_ratio`

Type: `string` (`16:9`, `1:1`, `9:16`) Default: `1:1`

`auto_scale_input`

Type: `boolean` Default: `false`

Fit videos to the target frame count and frame rate.

`split_input_into_scenes`

Type: `boolean` Default: `false`

Off for this mode: scene splitting would desync the reference/target pair. Provide pre-split clips instead.

`split_input_duration_threshold`

Type: `number` Default: `30.0` (range `1.0`–`60.0`)

Duration threshold for scene splitting (only relevant if splitting were enabled).

Audio Configuration

`audio_normalize`

Type: `boolean` Default: `true`

Peak-normalize audio for consistent loudness across the dataset.

`audio_preserve_pitch`

Type: `boolean` Default: `true`

Preserve pitch when fitting audio to video duration.

Validation

`validation`

Type: `array` Default: `[]` (max 2 entries)

Validation samples, each an object with:

`prompt` (`string`) — the text prompt.
`reference_video_url` (`string`, required) — the reference video (its audio track is also used as the audio reference) that conditions the generated audio+video.
`reference_audio_url` (`string`, optional) — a separate reference audio; if omitted, the audio is taken from the reference video's own track — in which case that reference video must contain an audio track, or the validation sample is rejected with a 422.

`validation_negative_prompt`

Type: `string` Default: a built-in quality negative prompt.

`validation_number_of_frames`

Type: `integer` Default: `89` (range `9`–`121`)

`validation_frame_rate`

Type: `integer` Default: `24` (range `8`–`60`)

`validation_resolution`

Type: `string` Default: `high`

`validation_aspect_ratio`

Type: `string` Default: `1:1`

`stg_scale`

Type: `number` Default: `1.0` (range `0.0`–`3.0`)

`debug_dataset`

Type: `boolean` Default: `false`

Return an archive of the preprocessed data for inspection.

Outputs

`lora_file` — the trained IC-LoRA weights (`.safetensors`).
`config_file` — JSON describing the trigger phrase and training type.
`video` — combined validation reel; the preview video carries the generated audio track.
`debug_dataset` — preprocessed-data archive, only when `debug_dataset` is enabled.

Billing

A successful run is billed `max(100, number_of_steps)` billable units. Requests that fail before training completes (input-validation errors / HTTP 422, or dataset-download failures) are billed 0 units.

How the Training Works

Pipeline Overview

Preprocessing — the archive is extracted, `_start`/`_end` pairs and captions matched, both clips verified to have audio, clips fit to the resolution bucket, and the video reference optionally downscaled / temporally downsampled. The target's audio is extracted from its own track; the reference's audio is extracted from the reference clip.
Training — the IC-LoRA trains for `number_of_steps`. Both modalities are generated: the video conditioned on the reference video, the audio conditioned on the reference audio. Validation previews run at intervals.
Output — the IC-LoRA, config, and validation reel (with generated audio) are uploaded.

What Happens to Your Data

Archive extraction: the `.zip` is unpacked; macOS metadata and hidden files are ignored.
Pair matching: each `<name>_start` is paired with its `<name>_end` and the optional `<name>.txt`. Targets missing a reference cause a clear error.
Audio requirement: both clips of every pair must carry audio; silent pairs are rejected.
Video fitting: both clips are resized to fill the resolution bucket and center-cropped, staying aligned.
Reference scaling: when set above 1, the reference video is downscaled / temporally downsampled before encoding so the LoRA learns to drive full-resolution output from a coarse reference.
Captions: the trigger phrase (if set) is prepended.

What an IC-LoRA Is

An IC-LoRA performs an in-context transformation: it conditions on a reference clip supplied at inference and produces a transformed result — here, both video and audio together. The trained file specializes the base model at your reference→target audiovisual mapping. At inference you supply a reference video (and optionally a separate reference audio) plus a prompt, and the LoRA produces the corresponding target clip with both generated video and audio.

Tips for Getting Good Results

Dataset Quality

Use at least 10 well-aligned reference→target pairs, each with clean, in-sync audio.
Reference and target should describe the same scene/sound differing only by the transformation you want learned.

Caption Best Practices

Describe the target's content and sound plainly, optionally with a trigger phrase.
Keep captions consistent in style across pairs.

Good caption: `av_st9le a neon-lit street with a synthwave soundtrack` Weak caption: `street`

Trigger Phrases

A distinctive trigger phrase helps invoke the transformation cleanly; include it in every caption and at inference.

Reference Scaling

To drive generation from a coarse video control map, set `reference_downscale_factor` above 1 and supply matching coarse references at inference; the validation localizer mirrors training-time scaling for you.

Inference Format Matching

At inference, supply a reference video (and optionally a separate reference audio), use the same trigger phrase, and apply the same reference scaling you trained with.

Recommended Starting Configuration

json
{
  "training_data_url": "https://example.com/av2av_pairs.zip",
  "trigger_phrase": "av_st9le",
  "rank": 32,
  "number_of_steps": 2000,
  "learning_rate": 0.0002,
  "reference_downscale_factor": 1,
  "reference_temporal_scale_factor": 1,
  "number_of_frames": 89,
  "frame_rate": 24,
  "resolution": "medium",
  "aspect_ratio": "1:1",
  "validation": [
    { "prompt": "av_st9le a neon-lit street", "reference_video_url": "https://example.com/ref.mp4" }
  ]
}

Diagnosing Issues

Transformation too weak: more steps, higher `rank`, or more consistent pairs.
Audio or video ignores the reference: ensure pairs are well aligned and differ only by the intended transformation; check reference scaling matches at inference.
Dataset rejected for silence: both clips of every pair must contain audio.
Overfitting: fewer steps, lower `rank`, more pairs.

Validation Prompt Tips

Use a fresh reference clip to gauge generalization.
Match the caption style and trigger phrase you trained with.

Common Pitfalls

Silent reference or target clips (rejected).
Missing or mismatched `_start`/`_end` pairs.
Using different reference scaling at inference than at training.

fal-ai/ltx23-trainer-v2/ic-lora/av2av

Input

Training history

Nothing here yet...

LTX 2.3 Trainer — Audio+Video Reference IC-LoRA (`/ic-lora/av2av`)

Overview

Dataset Format

Input Parameters Reference

Dataset

`training_data_url` (required)

`trigger_phrase`

Training Parameters

`rank`

`number_of_steps`

`learning_rate`

`reference_downscale_factor`

`reference_temporal_scale_factor`

Video Configuration

`number_of_frames`

`frame_rate`

`resolution`

`aspect_ratio`

`auto_scale_input`

`split_input_into_scenes`

`split_input_duration_threshold`

Audio Configuration

`audio_normalize`

`audio_preserve_pitch`

Validation

`validation`

`validation_negative_prompt`

`validation_number_of_frames`

`validation_frame_rate`

`validation_resolution`

`validation_aspect_ratio`

`stg_scale`

`debug_dataset`

Outputs

Billing

How the Training Works

Pipeline Overview

What Happens to Your Data

What an IC-LoRA Is

Tips for Getting Good Results

Dataset Quality

Caption Best Practices

Trigger Phrases

Reference Scaling

Inference Format Matching

Recommended Starting Configuration

Diagnosing Issues

Validation Prompt Tips

Common Pitfalls