fal-ai/ltx23-trainer-v2/a2a

Train a LoRA that transforms one audio clip into another, learning a reference→target mapping from paired audio examples.
Training
Commercial use

Input

Additional Settings

Customize your input with more control.

The cost of training depends on the number of steps. The formula is: 0.0023 * steps. With 1000 steps, your request will cost $2.30.

Training history

Note: these are the most recent training requests. For the full history, check the requests tab.

LTX 2.3 Trainer — Audio-to-Audio (`/a2a`)

Overview

The `/a2a` endpoint trains a LoRA for the LTX 2.3 model that performs an audio → audio transformation. It is conditioned on a reference audio clip and learns to produce a corresponding target audio clip. You provide pairs of audio — a reference and a target — and the LoRA learns the mapping. This is audio-only: there is no video modality.

Use this endpoint for control-driven audio transformations: style transfer, timbre/character changes, denoising/restoration, or any reference-to-audio mapping you can demonstrate with paired clips.

Key features:

  • Trains an audio LoRA from paired reference→target audio clips.
  • Audio-only — no video is processed or generated.
  • Fixed audio length bucket via `audio_duration_seconds`.
  • Validation previews transform a supplied reference audio clip; the output preview is audio.

Dataset Format

Provide a single `.zip` archive (linked via `training_data_url`) of paired audio clips:

  • `<name>_start.<ext>` — the reference audio (the input to transform).
  • `<name>_end.<ext>` — the target audio (the desired output).
  • `<name>.txt` — optional caption for the pair.

Audio formats: `.wav`, `.mp3`, `.flac`, `.ogg`, `.aac`, `.m4a`. Each `_end` must have a matching `_start`. File names must be unique across the archive. Aim for at least 10 pairs.

Minimum clip length (per pair): at least one pair must have both its reference (`_start`) and target (`_end`) clip at least `audio_duration_seconds` long. A pair with either side too short is skipped, and if no pair qualifies the request fails with a 422.

Example layout:

sample01_start.wav   sample01_end.wav   sample01.txt
sample02_start.mp3   sample02_end.wav   sample02.txt

At least one pair must fill the audio bucket. A clip shorter than `audio_duration_seconds` is skipped, and a pair is usable only when both its reference and target are at least `audio_duration_seconds` long; if no pair qualifies, the request is rejected (HTTP 422). Lower `audio_duration_seconds` if your clips are short.

Input Parameters Reference

Dataset
`training_data_url` (required)

Type: `string`

URL to the `.zip` archive of `_start`/`_end` audio pairs.

`trigger_phrase`

Type: `string` Default: `""`

Phrase prepended to captions during training; include it at inference.

Training Parameters
`rank`

Type: `integer` (`8`, `16`, `32`, `64`, `128`) Default: `32`

LoRA capacity.

`number_of_steps`

Type: `integer` Default: `2000` (range `100``20000`)

Number of optimization steps.

`learning_rate`

Type: `number` Default: `0.0002`

Optimization step size.

Audio Configuration
`audio_duration_seconds`

Type: `number` Default: `5.0` (range `0.5``60.0`)

Target audio clip length in seconds (the audio duration bucket). Clips shorter than this are skipped; longer clips are trimmed. The check is per pair: at least one pair must have both its `_start` and `_end` clip fill the bucket — a pair with either side too short is dropped, and if no pair qualifies the request fails with a 422. Audio-only modes need this because there is no video frame count to derive a length from.

`audio_normalize`

Type: `boolean` Default: `true`

Peak-normalize audio for consistent loudness across the dataset.

`audio_preserve_pitch`

Type: `boolean` Default: `true`

Preserve pitch when fitting audio to the target duration (instead of trimming/padding).

Dataset Processing
`split_input_into_scenes`

Type: `boolean` Default: `false`

Off for this mode: scene splitting would desync the reference/target pair. Provide pre-split clips instead.

`split_input_duration_threshold`

Type: `number` Default: `30.0` (range `1.0``60.0`)

Duration threshold for scene splitting (only relevant if splitting were enabled).

Validation
`validation`

Type: `array` Default: `[]` (max 2 entries)

Validation samples, each an object with:

  • `prompt` (`string`) — the text prompt.
  • `reference_audio_url` (`string`, required) — the reference audio that conditions the generated audio.
`validation_negative_prompt`

Type: `string` Default: a built-in quality negative prompt.

`validation_frame_rate`

Type: `integer` Default: `24` (range `8``60`)

Used together with the audio bucket to size the preview audio length.

`stg_scale`

Type: `number` Default: `1.0` (range `0.0``3.0`)

`debug_dataset`

Type: `boolean` Default: `false`

Return an archive of the preprocessed data for inspection.

Note: this audio-only mode also accepts the shared video/validation-video fields (`number_of_frames`, `resolution`, `validation_resolution`, etc.), but they have no effect — no video is processed. You can safely leave them at their defaults.

Outputs

  • `lora_file` — the trained LoRA weights (`.safetensors`).
  • `config_file` — JSON describing the trigger phrase and training type.
  • `audio` — a combined preview of the generated validation audio.
  • `debug_dataset` — preprocessed-data archive, only when `debug_dataset` is enabled.

Billing

A successful run is billed `max(100, number_of_steps)` billable units. Requests that fail before training completes (input-validation errors / HTTP 422, or dataset-download failures) are billed 0 units.

How the Training Works

Pipeline Overview
  1. Preprocessing — the archive is extracted, `_start`/`_end` audio pairs and captions matched, and each clip fit to the `audio_duration_seconds` bucket.
  2. Training — the LoRA trains for `number_of_steps`, conditioned on the reference audio, learning to produce the target audio. Validation previews run at intervals.
  3. Output — the LoRA, config, and combined audio preview are uploaded.
What Happens to Your Data
  • Archive extraction: the `.zip` is unpacked; macOS metadata and hidden files are ignored.
  • Pair matching: each `<name>_start` is paired with its `<name>_end` and the optional `<name>.txt`. A target missing its reference causes a clear error.
  • Audio fitting: each clip is fit to the `audio_duration_seconds` bucket (clips shorter than the bucket are skipped; longer ones trimmed; normalized and pitch-fit as configured).
  • Captions: the trigger phrase (if set) is prepended.
How It Works

This is a LoRA trained to perform a reference-conditioned transformation: rather than generating from text alone, it conditions on a reference clip supplied at inference and produces a transformed result. The trained file specializes the base model on your reference→target mapping. Here the result is audio derived from a reference audio clip. At inference you supply a reference audio clip plus a prompt.

Tips for Getting Good Results

Dataset Quality
  • Use at least 10 well-matched reference→target audio pairs; the two should differ only by the transformation you want learned.
  • Keep recordings clean and consistent in level (normalization helps, but garbage in means garbage out).
  • Choose an `audio_duration_seconds` that fits most of your clips so few are skipped.
Caption Best Practices
  • Describe the target audio plainly, optionally with a trigger phrase.
  • Keep captions consistent in style so the LoRA associates the transformation, not the wording.

Good caption: `tonech4nge a warm vintage-radio version of the voice` Weak caption: `voice`

Trigger Phrases
  • A distinctive trigger phrase helps invoke the transformation cleanly; include it in every caption and at inference.
Inference Format Matching

At inference, supply a reference audio clip, use the same trigger phrase and caption style, and keep clip lengths in line with `audio_duration_seconds`.

json
{
  "training_data_url": "https://example.com/a2a_pairs.zip",
  "trigger_phrase": "tonech4nge",
  "rank": 32,
  "number_of_steps": 2000,
  "learning_rate": 0.0002,
  "audio_duration_seconds": 5.0,
  "validation": [
    { "prompt": "tonech4nge", "reference_audio_url": "https://example.com/ref.wav" }
  ]
}
Diagnosing Issues
  • Transformation too weak: more steps, higher `rank`, or more consistent pairs.
  • Many clips skipped: lower `audio_duration_seconds` to match your clips.
  • Output ignores the reference: ensure pairs differ only by the intended transformation; keep captions consistent.
  • Overfitting: fewer steps, lower `rank`, more pairs.
Validation Prompt Tips
  • Use a reference clip the LoRA has not seen to gauge generalization.
  • Match the caption style and trigger phrase you trained with.
Common Pitfalls
  • Missing or mismatched `_start`/`_end` pairs.
  • `audio_duration_seconds` set so long that most clips are skipped.
  • Reference/target pairs that differ by more than the intended transformation.