fal-ai/ltx23-trainer-v2/a2v
Input
Hint: Upload a prepared .zip archive or provide a URL. See the field description for the required file layout.
Customize your input with more control.
The cost of training depends on the number of steps. The formula is: 0.006 * steps. With 1000 steps, your request will cost $6.00.
Training history
Nothing here yet...
Fine-tune your training parameters and start right now.
LTX 2.3 Trainer — Audio-to-Video (`/a2v`)
Overview
The `/a2v` endpoint trains a LoRA for the LTX 2.3 video model that generates video driven by a start image plus a conditioning audio track. The model conditions on the first frame (the start image) and the audio, and learns to produce a video that matches the sound — for example talking-head / lip-sync-style motion, or audio-reactive animation.
Key features:
- Learns image + audio → video generation.
- The start image is held as the first frame; the audio is frozen as conditioning (not generated, only used to drive the video).
- Optional scene splitting after the start-image/audio assembly.
- Validation previews require both a start image and a conditioning audio track.
Dataset Format
Provide a single `.zip` archive (linked via `training_data_url`) where each training example is a group of files sharing a base name:
`<name>_start.png`(or`.jpg`/`.jpeg`) — the start image (becomes the first frame). Alternatively a`<name>_start.mp4`(or`.mov`/`.avi`/`.mkv`) whose first frame is used.`<name>_audio.wav`(or`.mp3`,`.ogg`,`.m4a`,`.aac`,`.flac`) — the conditioning audio. (If a`<name>_start.mp4`already contains an audio track, a separate audio file is not required.)`<name>_end.mp4`(or`.mov`/`.avi`/`.mkv`) — the target video the model learns to produce.`<name>.txt`— caption (optional only if a`trigger_phrase`is set; otherwise required).
Each group needs a start image/video, an audio source, and an `_end` target. File names must be unique across the archive. Source clips may have different native sizes; preprocessing resizes and center-crops each sample to the chosen training resolution.
Minimum clip length: with `auto_scale_input` off (the default), each target (`_end`) video must have at least `number_of_frames` frames (default 89, ≈ 3.7 s at 24 fps) — the synthesized training clip inherits the target's length. Shorter targets are silently skipped, and if every one is too short the request fails (422). Turn on `auto_scale_input` to resample instead.
Example layout:
clip01_start.png clip01_audio.wav clip01_end.mp4 clip01.txt clip02_start.png clip02_audio.mp3 clip02_end.mp4 clip02.txt
Audio is required on every clip. Each group must provide an audio source (a separate `_audio` file, or a `_start` video that carries an audio track). If any group has no usable audio, the request is rejected (HTTP 422).
Input Parameters Reference
Dataset
`training_data_url` (required)
Type: `string`
URL to the `.zip` archive of audio-to-video groups.
`trigger_phrase`
Type: `string`
Default: `""`
Phrase prepended to captions during training; include it at inference.
Training Parameters
`rank`
Type: `integer` (`8`, `16`, `32`, `64`, `128`)
Default: `32`
LoRA capacity.
`number_of_steps`
Type: `integer`
Default: `2000` (range `100`–`20000`)
Number of optimization steps.
`learning_rate`
Type: `number`
Default: `0.0002`
Optimization step size.
Video Configuration
`number_of_frames`
Type: `integer`
Default: `89` (range `9`–`121`)
Frames per training clip. Must satisfy `frames % 8 == 1`; other values are snapped down to the nearest valid count.
`frame_rate`
Type: `integer`
Default: `24` (range `8`–`60`)
Target frames per second.
`resolution`
Type: `string` (`low`, `medium`, `high`)
Default: `medium`
| Resolution | 16:9 | 1:1 | 9:16 |
|---|---|---|---|
| low | 512×288 | 512×512 | 288×512 |
| medium | 768×448 | 768×768 | 448×768 |
| high | 960×544 | 960×960 | 544×960 |
`aspect_ratio`
Type: `string` (`16:9`, `1:1`, `9:16`)
Default: `1:1`
`auto_scale_input`
Type: `boolean`
Default: `false`
Fit videos to the target frame count and frame rate.
`split_input_into_scenes`
Type: `boolean`
Default: `false`
When true, synthesized training videos above the duration threshold are split into scenes after the start-image/audio assembly. Off by default for this mode.
`split_input_duration_threshold`
Type: `number`
Default: `30.0` (range `1.0`–`60.0`)
Duration above which a synthesized clip is eligible for scene splitting.
Audio Configuration
`audio_normalize`
Type: `boolean`
Default: `true`
Peak-normalize the conditioning audio for consistent loudness across the dataset.
`audio_preserve_pitch`
Type: `boolean`
Default: `true`
Preserve pitch when fitting audio to the video duration (instead of trimming/padding).
Validation
`validation`
Type: `array`
Default: `[]` (max 2 entries)
Validation samples, each an object with:
`prompt`(`string`) — the text prompt.`image_url`(`string`, required) — the start image used as the first frame.`audio_url`(`string`, required) — the conditioning audio track that drives the video.
`validation_negative_prompt`
Type: `string`
Default: a built-in quality negative prompt.
`validation_number_of_frames`
Type: `integer`
Default: `89` (range `9`–`121`)
`validation_frame_rate`
Type: `integer`
Default: `24` (range `8`–`60`)
`validation_resolution`
Type: `string`
Default: `high`
`validation_aspect_ratio`
Type: `string`
Default: `1:1`
`stg_scale`
Type: `number`
Default: `1.0` (range `0.0`–`3.0`)
`debug_dataset`
Type: `boolean`
Default: `false`
Return an archive of the preprocessed data for inspection.
Outputs
`lora_file`— the trained LoRA weights (`.safetensors`).`config_file`— JSON describing the trigger phrase and training type.`video`— combined validation reel (when validation samples were provided).`debug_dataset`— preprocessed-data archive, only when`debug_dataset`is enabled.
Billing
A successful run is billed `max(100, number_of_steps)` billable units. Requests that fail before training completes (input-validation errors / HTTP 422, or dataset-download failures) are billed 0 units.
How the Training Works
Pipeline Overview
- Preprocessing — the archive is extracted and grouped by base name; each group's start image is placed as the first frame, the conditioning audio is attached, and the target video is fit to the resolution bucket. Groups above the threshold are optionally scene-split afterward.
- Training — the LoRA trains for
`number_of_steps`, conditioning on the first frame and on the (frozen) audio, learning to generate the target video. Validation previews run at intervals. - Output — the LoRA, config, and validation reel are uploaded.
What Happens to Your Data
- Archive extraction: the
`.zip`is unpacked; macOS`__MACOSX`metadata folders are ignored. - Grouping: files are grouped by base name into (start image/video, audio, target video, caption). If a start video with an audio track is provided, its first frame becomes the start image and its track becomes the conditioning audio.
- Assembly: the start image is written into the first frame of each target clip, and its audio track is attached. (Audio is normalized and fit to the clip during preprocessing, per
`audio_normalize`/`audio_preserve_pitch`.) - Video fitting: target clips are resized to fill the resolution bucket and center-cropped.
- Captions: the trigger phrase (if set) is prepended.
How Conditioning Works
The first frame (your start image) is held fixed and the audio is used only as a driving signal — it is not regenerated. The LoRA learns to produce video that begins from the supplied image and moves in a way consistent with the audio.
Tips for Getting Good Results
Dataset Quality
- Use at least 10 groups; more variety in speakers/sounds/motions helps.
- Start images should be clean, sharp, and representative of the first frame you want at inference.
- The audio track and target motion should genuinely correspond (e.g. matching speech and lip movement).
Caption Best Practices
- Describe the scene and motion plainly, optionally with a trigger phrase.
- Keep captions consistent in style across groups.
Good caption: `a person speaking to camera in a bright office`
Weak caption: `talking`
Trigger Phrases
- Use a distinctive trigger phrase to invoke the behavior cleanly; include it in every caption and at inference.
Inference Format Matching
At inference, supply both a start image and a conditioning audio track, and use the same caption style and trigger phrase you trained with.
Recommended Starting Configuration
json{ "training_data_url": "https://example.com/a2v_groups.zip", "trigger_phrase": "", "rank": 32, "number_of_steps": 2000, "learning_rate": 0.0002, "number_of_frames": 89, "frame_rate": 24, "resolution": "medium", "aspect_ratio": "1:1", "validation": [ { "prompt": "a person speaking to camera", "image_url": "https://example.com/face.png", "audio_url": "https://example.com/speech.wav" } ] }
Diagnosing Issues
- Motion does not match the audio: ensure your training pairs genuinely correspond; try lowering
`learning_rate`if motion looks unstable, or add more well-matched data. - Color drift / artifacts over time: lower the
`learning_rate`and/or add more varied, clean data. - Overfitting: fewer steps, lower
`rank`, more groups. - Dataset errors: each group needs a start image/video, an audio source, and an
`_end`target with matching base names.
Validation Prompt Tips
- Provide a fresh start image and audio track (not from training) to gauge generalization.
- Make the audio length reasonable relative to
`validation_number_of_frames`/`validation_frame_rate`.
Common Pitfalls
- Missing audio source for a group (no
`_audio`file and no audio track in the start video). - Start image that does not match the target footage style.
- Forgetting to supply both image and audio at inference.