fal-ai/ltx23-trainer-v2/a2v

Train a LoRA that generates video from a start image plus a conditioning audio track, producing motion that matches the sound.
Training
Commercial use

Input

Additional Settings

Customize your input with more control.

The cost of training depends on the number of steps. The formula is: 0.006 * steps. With 1000 steps, your request will cost $6.00.

Training history

Note: these are the most recent training requests. For the full history, check the requests tab.

LTX 2.3 Trainer — Audio-to-Video (`/a2v`)

Overview

The `/a2v` endpoint trains a LoRA for the LTX 2.3 video model that generates video driven by a start image plus a conditioning audio track. The model conditions on the first frame (the start image) and the audio, and learns to produce a video that matches the sound — for example talking-head / lip-sync-style motion, or audio-reactive animation.

Key features:

  • Learns image + audio → video generation.
  • The start image is held as the first frame; the audio is frozen as conditioning (not generated, only used to drive the video).
  • Optional scene splitting after the start-image/audio assembly.
  • Validation previews require both a start image and a conditioning audio track.

Dataset Format

Provide a single `.zip` archive (linked via `training_data_url`) where each training example is a group of files sharing a base name:

  • `<name>_start.png` (or `.jpg`/`.jpeg`) — the start image (becomes the first frame). Alternatively a `<name>_start.mp4` (or `.mov`/`.avi`/`.mkv`) whose first frame is used.
  • `<name>_audio.wav` (or `.mp3`, `.ogg`, `.m4a`, `.aac`, `.flac`) — the conditioning audio. (If a `<name>_start.mp4` already contains an audio track, a separate audio file is not required.)
  • `<name>_end.mp4` (or `.mov`/`.avi`/`.mkv`) — the target video the model learns to produce.
  • `<name>.txt` — caption (optional only if a `trigger_phrase` is set; otherwise required).

Each group needs a start image/video, an audio source, and an `_end` target. File names must be unique across the archive. Source clips may have different native sizes; preprocessing resizes and center-crops each sample to the chosen training resolution.

Minimum clip length: with `auto_scale_input` off (the default), each target (`_end`) video must have at least `number_of_frames` frames (default 89, ≈ 3.7 s at 24 fps) — the synthesized training clip inherits the target's length. Shorter targets are silently skipped, and if every one is too short the request fails (422). Turn on `auto_scale_input` to resample instead.

Example layout:

clip01_start.png   clip01_audio.wav   clip01_end.mp4   clip01.txt
clip02_start.png   clip02_audio.mp3   clip02_end.mp4   clip02.txt

Audio is required on every clip. Each group must provide an audio source (a separate `_audio` file, or a `_start` video that carries an audio track). If any group has no usable audio, the request is rejected (HTTP 422).

Input Parameters Reference

Dataset
`training_data_url` (required)

Type: `string`

URL to the `.zip` archive of audio-to-video groups.

`trigger_phrase`

Type: `string` Default: `""`

Phrase prepended to captions during training; include it at inference.

Training Parameters
`rank`

Type: `integer` (`8`, `16`, `32`, `64`, `128`) Default: `32`

LoRA capacity.

`number_of_steps`

Type: `integer` Default: `2000` (range `100``20000`)

Number of optimization steps.

`learning_rate`

Type: `number` Default: `0.0002`

Optimization step size.

Video Configuration
`number_of_frames`

Type: `integer` Default: `89` (range `9``121`)

Frames per training clip. Must satisfy `frames % 8 == 1`; other values are snapped down to the nearest valid count.

`frame_rate`

Type: `integer` Default: `24` (range `8``60`)

Target frames per second.

`resolution`

Type: `string` (`low`, `medium`, `high`) Default: `medium`

Resolution16:91:19:16
low512×288512×512288×512
medium768×448768×768448×768
high960×544960×960544×960
`aspect_ratio`

Type: `string` (`16:9`, `1:1`, `9:16`) Default: `1:1`

`auto_scale_input`

Type: `boolean` Default: `false`

Fit videos to the target frame count and frame rate.

`split_input_into_scenes`

Type: `boolean` Default: `false`

When true, synthesized training videos above the duration threshold are split into scenes after the start-image/audio assembly. Off by default for this mode.

`split_input_duration_threshold`

Type: `number` Default: `30.0` (range `1.0``60.0`)

Duration above which a synthesized clip is eligible for scene splitting.

Audio Configuration
`audio_normalize`

Type: `boolean` Default: `true`

Peak-normalize the conditioning audio for consistent loudness across the dataset.

`audio_preserve_pitch`

Type: `boolean` Default: `true`

Preserve pitch when fitting audio to the video duration (instead of trimming/padding).

Validation
`validation`

Type: `array` Default: `[]` (max 2 entries)

Validation samples, each an object with:

  • `prompt` (`string`) — the text prompt.
  • `image_url` (`string`, required) — the start image used as the first frame.
  • `audio_url` (`string`, required) — the conditioning audio track that drives the video.
`validation_negative_prompt`

Type: `string` Default: a built-in quality negative prompt.

`validation_number_of_frames`

Type: `integer` Default: `89` (range `9``121`)

`validation_frame_rate`

Type: `integer` Default: `24` (range `8``60`)

`validation_resolution`

Type: `string` Default: `high`

`validation_aspect_ratio`

Type: `string` Default: `1:1`

`stg_scale`

Type: `number` Default: `1.0` (range `0.0``3.0`)

`debug_dataset`

Type: `boolean` Default: `false`

Return an archive of the preprocessed data for inspection.

Outputs

  • `lora_file` — the trained LoRA weights (`.safetensors`).
  • `config_file` — JSON describing the trigger phrase and training type.
  • `video` — combined validation reel (when validation samples were provided).
  • `debug_dataset` — preprocessed-data archive, only when `debug_dataset` is enabled.

Billing

A successful run is billed `max(100, number_of_steps)` billable units. Requests that fail before training completes (input-validation errors / HTTP 422, or dataset-download failures) are billed 0 units.

How the Training Works

Pipeline Overview
  1. Preprocessing — the archive is extracted and grouped by base name; each group's start image is placed as the first frame, the conditioning audio is attached, and the target video is fit to the resolution bucket. Groups above the threshold are optionally scene-split afterward.
  2. Training — the LoRA trains for `number_of_steps`, conditioning on the first frame and on the (frozen) audio, learning to generate the target video. Validation previews run at intervals.
  3. Output — the LoRA, config, and validation reel are uploaded.
What Happens to Your Data
  • Archive extraction: the `.zip` is unpacked; macOS `__MACOSX` metadata folders are ignored.
  • Grouping: files are grouped by base name into (start image/video, audio, target video, caption). If a start video with an audio track is provided, its first frame becomes the start image and its track becomes the conditioning audio.
  • Assembly: the start image is written into the first frame of each target clip, and its audio track is attached. (Audio is normalized and fit to the clip during preprocessing, per `audio_normalize` / `audio_preserve_pitch`.)
  • Video fitting: target clips are resized to fill the resolution bucket and center-cropped.
  • Captions: the trigger phrase (if set) is prepended.
How Conditioning Works

The first frame (your start image) is held fixed and the audio is used only as a driving signal — it is not regenerated. The LoRA learns to produce video that begins from the supplied image and moves in a way consistent with the audio.

Tips for Getting Good Results

Dataset Quality
  • Use at least 10 groups; more variety in speakers/sounds/motions helps.
  • Start images should be clean, sharp, and representative of the first frame you want at inference.
  • The audio track and target motion should genuinely correspond (e.g. matching speech and lip movement).
Caption Best Practices
  • Describe the scene and motion plainly, optionally with a trigger phrase.
  • Keep captions consistent in style across groups.

Good caption: `a person speaking to camera in a bright office` Weak caption: `talking`

Trigger Phrases
  • Use a distinctive trigger phrase to invoke the behavior cleanly; include it in every caption and at inference.
Inference Format Matching

At inference, supply both a start image and a conditioning audio track, and use the same caption style and trigger phrase you trained with.

json
{
  "training_data_url": "https://example.com/a2v_groups.zip",
  "trigger_phrase": "",
  "rank": 32,
  "number_of_steps": 2000,
  "learning_rate": 0.0002,
  "number_of_frames": 89,
  "frame_rate": 24,
  "resolution": "medium",
  "aspect_ratio": "1:1",
  "validation": [
    {
      "prompt": "a person speaking to camera",
      "image_url": "https://example.com/face.png",
      "audio_url": "https://example.com/speech.wav"
    }
  ]
}
Diagnosing Issues
  • Motion does not match the audio: ensure your training pairs genuinely correspond; try lowering `learning_rate` if motion looks unstable, or add more well-matched data.
  • Color drift / artifacts over time: lower the `learning_rate` and/or add more varied, clean data.
  • Overfitting: fewer steps, lower `rank`, more groups.
  • Dataset errors: each group needs a start image/video, an audio source, and an `_end` target with matching base names.
Validation Prompt Tips
  • Provide a fresh start image and audio track (not from training) to gauge generalization.
  • Make the audio length reasonable relative to `validation_number_of_frames` / `validation_frame_rate`.
Common Pitfalls
  • Missing audio source for a group (no `_audio` file and no audio track in the start video).
  • Start image that does not match the target footage style.
  • Forgetting to supply both image and audio at inference.