fal-ai/ltx23-trainer-v2/a2a
Input
Hint: Upload a prepared .zip archive or provide a URL. See the field description for the required file layout.
Customize your input with more control.
The cost of training depends on the number of steps. The formula is: 0.0023 * steps. With 1000 steps, your request will cost $2.30.
Training history
Nothing here yet...
Fine-tune your training parameters and start right now.
LTX 2.3 Trainer — Audio-to-Audio (`/a2a`)
Overview
The `/a2a` endpoint trains a LoRA for the LTX 2.3 model that performs an audio → audio transformation. It is conditioned on a reference audio clip and learns to produce a corresponding target audio clip. You provide pairs of audio — a reference and a target — and the LoRA learns the mapping. This is audio-only: there is no video modality.
Use this endpoint for control-driven audio transformations: style transfer, timbre/character changes, denoising/restoration, or any reference-to-audio mapping you can demonstrate with paired clips.
Key features:
- Trains an audio LoRA from paired reference→target audio clips.
- Audio-only — no video is processed or generated.
- Fixed audio length bucket via
`audio_duration_seconds`. - Validation previews transform a supplied reference audio clip; the output preview is audio.
Dataset Format
Provide a single `.zip` archive (linked via `training_data_url`) of paired audio clips:
`<name>_start.<ext>`— the reference audio (the input to transform).`<name>_end.<ext>`— the target audio (the desired output).`<name>.txt`— optional caption for the pair.
Audio formats: `.wav`, `.mp3`, `.flac`, `.ogg`, `.aac`, `.m4a`. Each `_end` must have a matching `_start`. File names must be unique across the archive. Aim for at least 10 pairs.
Minimum clip length (per pair): at least one pair must have both its reference (`_start`) and target (`_end`) clip at least `audio_duration_seconds` long. A pair with either side too short is skipped, and if no pair qualifies the request fails with a 422.
Example layout:
sample01_start.wav sample01_end.wav sample01.txt sample02_start.mp3 sample02_end.wav sample02.txt
At least one pair must fill the audio bucket. A clip shorter than `audio_duration_seconds` is skipped, and a pair is usable only when both its reference and target are at least `audio_duration_seconds` long; if no pair qualifies, the request is rejected (HTTP 422). Lower `audio_duration_seconds` if your clips are short.
Input Parameters Reference
Dataset
`training_data_url` (required)
Type: `string`
URL to the `.zip` archive of `_start`/`_end` audio pairs.
`trigger_phrase`
Type: `string`
Default: `""`
Phrase prepended to captions during training; include it at inference.
Training Parameters
`rank`
Type: `integer` (`8`, `16`, `32`, `64`, `128`)
Default: `32`
LoRA capacity.
`number_of_steps`
Type: `integer`
Default: `2000` (range `100`–`20000`)
Number of optimization steps.
`learning_rate`
Type: `number`
Default: `0.0002`
Optimization step size.
Audio Configuration
`audio_duration_seconds`
Type: `number`
Default: `5.0` (range `0.5`–`60.0`)
Target audio clip length in seconds (the audio duration bucket). Clips shorter than this are skipped; longer clips are trimmed. The check is per pair: at least one pair must have both its `_start` and `_end` clip fill the bucket — a pair with either side too short is dropped, and if no pair qualifies the request fails with a 422. Audio-only modes need this because there is no video frame count to derive a length from.
`audio_normalize`
Type: `boolean`
Default: `true`
Peak-normalize audio for consistent loudness across the dataset.
`audio_preserve_pitch`
Type: `boolean`
Default: `true`
Preserve pitch when fitting audio to the target duration (instead of trimming/padding).
Dataset Processing
`split_input_into_scenes`
Type: `boolean`
Default: `false`
Off for this mode: scene splitting would desync the reference/target pair. Provide pre-split clips instead.
`split_input_duration_threshold`
Type: `number`
Default: `30.0` (range `1.0`–`60.0`)
Duration threshold for scene splitting (only relevant if splitting were enabled).
Validation
`validation`
Type: `array`
Default: `[]` (max 2 entries)
Validation samples, each an object with:
`prompt`(`string`) — the text prompt.`reference_audio_url`(`string`, required) — the reference audio that conditions the generated audio.
`validation_negative_prompt`
Type: `string`
Default: a built-in quality negative prompt.
`validation_frame_rate`
Type: `integer`
Default: `24` (range `8`–`60`)
Used together with the audio bucket to size the preview audio length.
`stg_scale`
Type: `number`
Default: `1.0` (range `0.0`–`3.0`)
`debug_dataset`
Type: `boolean`
Default: `false`
Return an archive of the preprocessed data for inspection.
Note: this audio-only mode also accepts the shared video/validation-video fields (
`number_of_frames`,`resolution`,`validation_resolution`, etc.), but they have no effect — no video is processed. You can safely leave them at their defaults.
Outputs
`lora_file`— the trained LoRA weights (`.safetensors`).`config_file`— JSON describing the trigger phrase and training type.`audio`— a combined preview of the generated validation audio.`debug_dataset`— preprocessed-data archive, only when`debug_dataset`is enabled.
Billing
A successful run is billed `max(100, number_of_steps)` billable units. Requests that fail before training completes (input-validation errors / HTTP 422, or dataset-download failures) are billed 0 units.
How the Training Works
Pipeline Overview
- Preprocessing — the archive is extracted,
`_start`/`_end`audio pairs and captions matched, and each clip fit to the`audio_duration_seconds`bucket. - Training — the LoRA trains for
`number_of_steps`, conditioned on the reference audio, learning to produce the target audio. Validation previews run at intervals. - Output — the LoRA, config, and combined audio preview are uploaded.
What Happens to Your Data
- Archive extraction: the
`.zip`is unpacked; macOS metadata and hidden files are ignored. - Pair matching: each
`<name>_start`is paired with its`<name>_end`and the optional`<name>.txt`. A target missing its reference causes a clear error. - Audio fitting: each clip is fit to the
`audio_duration_seconds`bucket (clips shorter than the bucket are skipped; longer ones trimmed; normalized and pitch-fit as configured). - Captions: the trigger phrase (if set) is prepended.
How It Works
This is a LoRA trained to perform a reference-conditioned transformation: rather than generating from text alone, it conditions on a reference clip supplied at inference and produces a transformed result. The trained file specializes the base model on your reference→target mapping. Here the result is audio derived from a reference audio clip. At inference you supply a reference audio clip plus a prompt.
Tips for Getting Good Results
Dataset Quality
- Use at least 10 well-matched reference→target audio pairs; the two should differ only by the transformation you want learned.
- Keep recordings clean and consistent in level (normalization helps, but garbage in means garbage out).
- Choose an
`audio_duration_seconds`that fits most of your clips so few are skipped.
Caption Best Practices
- Describe the target audio plainly, optionally with a trigger phrase.
- Keep captions consistent in style so the LoRA associates the transformation, not the wording.
Good caption: `tonech4nge a warm vintage-radio version of the voice`
Weak caption: `voice`
Trigger Phrases
- A distinctive trigger phrase helps invoke the transformation cleanly; include it in every caption and at inference.
Inference Format Matching
At inference, supply a reference audio clip, use the same trigger phrase and caption style, and keep clip lengths in line with `audio_duration_seconds`.
Recommended Starting Configuration
json{ "training_data_url": "https://example.com/a2a_pairs.zip", "trigger_phrase": "tonech4nge", "rank": 32, "number_of_steps": 2000, "learning_rate": 0.0002, "audio_duration_seconds": 5.0, "validation": [ { "prompt": "tonech4nge", "reference_audio_url": "https://example.com/ref.wav" } ] }
Diagnosing Issues
- Transformation too weak: more steps, higher
`rank`, or more consistent pairs. - Many clips skipped: lower
`audio_duration_seconds`to match your clips. - Output ignores the reference: ensure pairs differ only by the intended transformation; keep captions consistent.
- Overfitting: fewer steps, lower
`rank`, more pairs.
Validation Prompt Tips
- Use a reference clip the LoRA has not seen to gauge generalization.
- Match the caption style and trigger phrase you trained with.
Common Pitfalls
- Missing or mismatched
`_start`/`_end`pairs. `audio_duration_seconds`set so long that most clips are skipped.- Reference/target pairs that differ by more than the intended transformation.