LTX 2.3 Trainer (V2) - Video-to-Audio (Training) API on fal

LTX 2.3 Trainer — Video-to-Audio / Foley (`/v2a`)

Overview

The `/v2a` endpoint trains a LoRA for the LTX 2.3 model that generates audio for a silent video (foley / sound design). The video is frozen as conditioning and the model learns to produce a matching soundtrack — footsteps, ambient sound, sound effects, or speech-like audio that fits the on-screen action.

Key features:

Learns video → audio generation (the video is held fixed; only audio is generated).
Trained on videos that already contain the audio you want the model to learn to recreate.
Validation previews generate audio for a supplied silent video; the output preview is an audio file.

Dataset Format

Provide a single `.zip` archive (linked via `training_data_url`) of plain videos that contain audio:

Videos: `.mp4`, `.mov`, `.avi`, `.mkv` — every clip must carry an audio track (the model learns to recreate it).
Captions: a `.txt` file with the same base name as each video (optional but recommended).

Image datasets are rejected (foley requires motion over time). Every video must have audio — clips without a track are rejected with a clear error. Aim for at least 10 clips. File names must be unique across the archive.

Minimum clip length: with `auto_scale_input` off (the default), each video must already have at least `number_of_frames` frames (default 89, ≈ 3.7 s at 24 fps). Shorter clips are silently skipped, and if every clip is too short the request fails (422, "All training videos are too short to be trainable"). Turn on `auto_scale_input` to resample shorter clips to the target frame count instead.

Audio is required on every clip. Video-to-audio training learns to generate the soundtrack for each clip, so every training video must contain an audio track; any silent clip causes the request to be rejected (HTTP 422).

Input Parameters Reference

Dataset

`training_data_url` (required)

Type: `string`

URL to the `.zip` archive of videos with audio.

`trigger_phrase`

Type: `string` Default: `""`

Phrase prepended to captions during training; include it at inference.

Training Parameters

`rank`

Type: `integer` (`8`, `16`, `32`, `64`, `128`) Default: `32`

LoRA capacity.

`number_of_steps`

Type: `integer` Default: `2000` (range `100`–`20000`)

Number of optimization steps.

`learning_rate`

Type: `number` Default: `0.0002`

Optimization step size.

Video Configuration

`number_of_frames`

Type: `integer` Default: `89` (range `9`–`121`)

Frames per training clip. Must satisfy `frames % 8 == 1`; other values are snapped down to the nearest valid count. This also sets how long a window of conditioning video is used.

`frame_rate`

Type: `integer` Default: `24` (range `8`–`60`)

Target frames per second.

`resolution`

Type: `string` (`low`, `medium`, `high`) Default: `medium`

Resolution	16:9	1:1	9:16
low	512×288	512×512	288×512
medium	768×448	768×768	448×768
high	960×544	960×960	544×960

`aspect_ratio`

Type: `string` (`16:9`, `1:1`, `9:16`) Default: `1:1`

`auto_scale_input`

Type: `boolean` Default: `false`

Fit videos to the target frame count and frame rate.

`split_input_into_scenes`

Type: `boolean` Default: `false`

Off by default for this mode: every clip must keep an audio track, and scene splitting can produce silent trailing segments that would fail the per-clip audio check. Provide pre-split clips instead.

`split_input_duration_threshold`

Type: `number` Default: `30.0` (range `1.0`–`60.0`)

Duration above which a clip is eligible for scene splitting (only relevant if you enable splitting).

Validation

`validation`

Type: `array` Default: `[]` (max 2 entries)

Validation samples, each an object with:

`prompt` (`string`) — the text prompt.
`video_url` (`string`, required) — a silent video the model should generate audio for.

`validation_negative_prompt`

Type: `string` Default: a built-in quality negative prompt.

`validation_number_of_frames`

Type: `integer` Default: `89` (range `9`–`121`)

`validation_frame_rate`

Type: `integer` Default: `24` (range `8`–`60`)

`validation_resolution`

Type: `string` Default: `high`

`validation_aspect_ratio`

Type: `string` Default: `1:1`

`stg_scale`

Type: `number` Default: `1.0` (range `0.0`–`3.0`)

`debug_dataset`

Type: `boolean` Default: `false`

Return an archive of the preprocessed data for inspection.

Outputs

`lora_file` — the trained LoRA weights (`.safetensors`).
`config_file` — JSON describing the trigger phrase and training type.
`audio` — a combined preview of the generated validation audio (this mode's preview is audio, not video).
`debug_dataset` — preprocessed-data archive, only when `debug_dataset` is enabled.

Billing

A successful run is billed `max(100, number_of_steps)` billable units. Requests that fail before training completes (input-validation errors / HTTP 422, or dataset-download failures) are billed 0 units.

How the Training Works

Pipeline Overview

Preprocessing — the archive is extracted, each video is verified to contain an audio track, clips are fit to the resolution bucket, and the audio is prepared.
Training — the LoRA trains for `number_of_steps` with the video frozen as conditioning and the audio as the generation target. Validation previews run at intervals.
Output — the LoRA, config, and combined audio preview are uploaded.

What Happens to Your Data

Archive extraction: the `.zip` is unpacked; macOS `__MACOSX` metadata folders are ignored.
Audio requirement: every clip is checked for an audio track; any silent clip causes a clear error so the dataset never trains on a mismatched set.
Video fitting: clips are resized to fill the resolution bucket and center-cropped; the audio is the learning target.
Captions: the trigger phrase (if set) is prepended.

How Foley Training Works

The video frames are held fixed during training and the model only generates audio, learning the relationship between what is on screen and how it should sound. At inference you supply a silent video and a prompt, and the LoRA produces a soundtrack for it.

Tips for Getting Good Results

Dataset Quality

Use at least 10 clips whose audio genuinely matches the action (clean, well-synced sound).
Avoid clips with background music or unrelated noise if you want a clean sound effect — the model learns whatever is in the track.
Provide pre-split clips at the right length rather than relying on scene splitting.

Caption Best Practices

Describe the sound and the on-screen action plainly, optionally with a trigger phrase.
Keep captions consistent in style across clips.

Good caption: `footsteps on gravel as a person walks toward camera` Weak caption: `walking`

Trigger Phrases

A distinctive trigger phrase helps invoke a specific sound character cleanly; include it in every caption and at inference.

Inference Format Matching

At inference, supply a silent video and use the same caption style and trigger phrase you trained with.

Recommended Starting Configuration

json
{
  "training_data_url": "https://example.com/foley_clips.zip",
  "trigger_phrase": "",
  "rank": 32,
  "number_of_steps": 2000,
  "learning_rate": 0.0002,
  "number_of_frames": 89,
  "frame_rate": 24,
  "resolution": "medium",
  "aspect_ratio": "1:1",
  "validation": [
    { "prompt": "footsteps on gravel", "video_url": "https://example.com/silent_walk.mp4" }
  ]
}

Diagnosing Issues

Audio is generic or unrelated to the video: add more clips where sound clearly matches action; keep captions specific.
Audio quality is poor: ensure source tracks are clean; remember the preview is a single-stage approximation and final inference quality differs.
Overfitting: fewer steps, lower `rank`, more clips.
Dataset errors: ensure all clips are videos (not images) and every clip contains audio.

Validation Prompt Tips

Use a fresh silent video to gauge generalization.
Keep the prompt focused on the sound you expect.

Common Pitfalls

Including silent clips (rejected).
Enabling scene splitting and producing silent trailing segments.
Background music polluting the learned sound.

fal-ai/ltx23-trainer-v2/v2a

Input

Training history

Nothing here yet...

LTX 2.3 Trainer — Video-to-Audio / Foley (`/v2a`)

Overview

Dataset Format

Input Parameters Reference

Dataset

`training_data_url` (required)

`trigger_phrase`

Training Parameters

`rank`

`number_of_steps`

`learning_rate`

Video Configuration

`number_of_frames`

`frame_rate`

`resolution`

`aspect_ratio`

`auto_scale_input`

`split_input_into_scenes`

`split_input_duration_threshold`

Validation

`validation`

`validation_negative_prompt`

`validation_number_of_frames`

`validation_frame_rate`

`validation_resolution`

`validation_aspect_ratio`

`stg_scale`

`debug_dataset`

Outputs

Billing

How the Training Works

Pipeline Overview

What Happens to Your Data

How Foley Training Works

Tips for Getting Good Results

Dataset Quality

Caption Best Practices

Trigger Phrases

Inference Format Matching

Recommended Starting Configuration

Diagnosing Issues

Validation Prompt Tips

Common Pitfalls