fal-ai/ltx23-trainer-v2/ic-lora/a2a

Train an IC-LoRA that transforms one audio clip into another, conditioned at inference on a reference audio clip.
Training
Commercial use

Input

Additional Settings

Customize your input with more control.

The cost of training depends on the number of steps. The formula is: 0.0023 * steps. With 1000 steps, your request will cost $2.30.

Training history

Note: these are the most recent training requests. For the full history, check the requests tab.

LTX 2.3 Trainer — Audio-to-Audio IC-LoRA (`/ic-lora/a2a`)

Overview

The `/ic-lora/a2a` endpoint trains an IC-LoRA (In-Context LoRA) for the LTX 2.3 model that performs an audio → audio transformation. An IC-LoRA is a small adapter that does not generate from text alone — instead it conditions on a reference audio clip supplied at inference and learns to produce a corresponding target audio clip. You teach it that mapping by providing pairs of audio — a reference and a target — and the LoRA learns to go from one to the other. This is audio-only: there is no video modality.

Use this endpoint for control-driven audio transformations, for example:

  • Style transfer between two kinds of sound.
  • Timbre or character changes (e.g. making a voice sound like a vintage radio).
  • Denoising or restoration demonstrated with clean/noisy pairs.
  • Any reference-to-audio mapping you can demonstrate with paired clips.

Key features:

  • Trains an audio IC-LoRA from paired reference→target audio clips.
  • Audio-only — no video is processed or generated.
  • Fixed audio length bucket via `audio_duration_seconds`.
  • Validation previews transform a supplied reference audio clip; the output preview is audio.

Dataset Format

Provide a single `.zip` archive (linked via `training_data_url`) of paired audio clips:

  • `<name>_start.<ext>` — the reference audio (the input to transform).
  • `<name>_end.<ext>` — the target audio (the desired output).
  • `<name>.txt` — optional caption for the pair.

Audio formats: `.wav`, `.mp3`, `.flac`, `.ogg`, `.aac`, `.m4a`. Each `_end` must have a matching `_start`. File names must be unique across the archive. Aim for at least 10 pairs.

Minimum clip duration (per pair): at least one pair must have both its reference (`_start`) and target (`_end`) clip at least `audio_duration_seconds` long. A pair with either side too short is skipped, and if no pair qualifies the request fails with a 422.

Example layout:

sample01_start.wav   sample01_end.wav   sample01.txt
sample02_start.mp3   sample02_end.wav   sample02.txt

At least one pair must fill the audio bucket. A clip shorter than `audio_duration_seconds` is skipped, and a pair is usable only when both its reference and target are at least `audio_duration_seconds` long; if no pair qualifies, the request is rejected (HTTP 422). Lower `audio_duration_seconds` if your clips are short.

Input Parameters Reference

Dataset
`training_data_url` (required)

Type: `string`

URL to the `.zip` archive of `_start`/`_end` audio pairs.

`trigger_phrase`

Type: `string` Default: `""`

Phrase prepended to captions during training; include it at inference.

Training Parameters
`rank`

Type: `integer` (`8`, `16`, `32`, `64`, `128`) Default: `32`

IC-LoRA capacity.

`number_of_steps`

Type: `integer` Default: `2000` (range `100``20000`)

Number of optimization steps.

`learning_rate`

Type: `number` Default: `0.0002`

Optimization step size.

Audio Configuration
`audio_duration_seconds`

Type: `number` Default: `5.0` (range `0.5``60.0`)

Target audio clip length in seconds (the audio duration bucket). Each clip is bucketed independently — clips shorter than this are skipped; longer clips are trimmed — and the reference (`_start`) and target (`_end`) are then paired by row, so a pair is usable only when both sides fill the bucket. At least one pair must qualify or the request fails with a 422. Audio-only modes need this because there is no video frame count to derive a length from.

`audio_normalize`

Type: `boolean` Default: `true`

Peak-normalize audio for consistent loudness across the dataset.

`audio_preserve_pitch`

Type: `boolean` Default: `true`

Preserve pitch when fitting audio to the target duration (instead of trimming/padding).

Dataset Processing
`split_input_into_scenes`

Type: `boolean` Default: `false`

Off for this mode: scene splitting would desync the reference/target pair. Provide pre-split clips instead.

`split_input_duration_threshold`

Type: `number` Default: `30.0` (range `1.0``60.0`)

Duration threshold for scene splitting (only relevant if splitting were enabled).

Validation
`validation`

Type: `array` Default: `[]` (max 2 entries)

Validation samples, each an object with:

  • `prompt` (`string`) — the text prompt.
  • `reference_audio_url` (`string`, required) — the reference audio that conditions the generated audio.
`validation_negative_prompt`

Type: `string` Default: a built-in quality negative prompt.

`validation_frame_rate`

Type: `integer` Default: `24` (range `8``60`)

Used together with the audio bucket to size the preview audio length.

`stg_scale`

Type: `number` Default: `1.0` (range `0.0``3.0`)

`debug_dataset`

Type: `boolean` Default: `false`

Return an archive of the preprocessed data for inspection.

Note: this audio-only mode also accepts the shared video/validation-video fields (`number_of_frames`, `resolution`, `validation_resolution`, etc.), but they have no effect — no video is processed. You can safely leave them at their defaults.

Outputs

  • `lora_file` — the trained IC-LoRA weights (`.safetensors`).
  • `config_file` — JSON describing the trigger phrase and training type.
  • `audio` — a combined preview of the generated validation audio.
  • `debug_dataset` — preprocessed-data archive, only when `debug_dataset` is enabled.

Billing

A successful run is billed `max(100, number_of_steps)` billable units. Requests that fail before training completes (input-validation errors / HTTP 422, or dataset-download failures) are billed 0 units.

How the Training Works

Pipeline Overview
  1. Preprocessing — the archive is extracted, `_start`/`_end` audio pairs and captions matched, and each clip fit to the `audio_duration_seconds` bucket.
  2. Training — the IC-LoRA trains for `number_of_steps`, conditioned on the reference audio, learning to produce the target audio. Validation previews run at intervals.
  3. Output — the IC-LoRA, config, and combined audio preview are uploaded.
What Happens to Your Data
  • Archive extraction: the `.zip` is unpacked; macOS metadata and hidden files are ignored.
  • Pair matching: each `<name>_start` is paired with its `<name>_end` and the optional `<name>.txt`. A target missing its reference causes a clear error.
  • Audio fitting: each clip is fit to the `audio_duration_seconds` bucket (clips shorter than the bucket are skipped; longer ones trimmed; normalized and pitch-fit as configured).
  • Captions: the trigger phrase (if set) is prepended.
What an IC-LoRA Is

An IC-LoRA performs an in-context transformation: rather than generating from text alone, it conditions on a reference clip supplied at inference and produces a transformed result. Here the result is audio derived from a reference audio clip. The trained file specializes the base model at your reference→target audio mapping. At inference you supply a reference audio clip plus a prompt, and the LoRA produces the corresponding target audio.

Tips for Getting Good Results

Dataset Quality
  • Use at least 10 well-matched reference→target audio pairs; the two should differ only by the transformation you want learned.
  • Keep recordings clean and consistent in level (normalization helps, but garbage in means garbage out).
  • Choose an `audio_duration_seconds` that fits most of your clips so few are skipped.
Caption Best Practices
  • Describe the target audio plainly, optionally with a trigger phrase.
  • Keep captions consistent in style so the LoRA associates the transformation, not the wording.

Good caption: `tonech4nge a warm vintage-radio version of the voice` Weak caption: `voice`

Trigger Phrases
  • A distinctive trigger phrase helps invoke the transformation cleanly; include it in every caption and at inference.
Inference Format Matching

At inference, supply a reference audio clip, use the same trigger phrase and caption style, and keep clip lengths in line with `audio_duration_seconds`.

json
{
  "training_data_url": "https://example.com/a2a_pairs.zip",
  "trigger_phrase": "tonech4nge",
  "rank": 32,
  "number_of_steps": 2000,
  "learning_rate": 0.0002,
  "audio_duration_seconds": 5.0,
  "validation": [
    { "prompt": "tonech4nge", "reference_audio_url": "https://example.com/ref.wav" }
  ]
}
Diagnosing Issues
  • Transformation too weak: more steps, higher `rank`, or more consistent pairs.
  • Many clips skipped: lower `audio_duration_seconds` to match your clips.
  • Output ignores the reference: ensure pairs differ only by the intended transformation; keep captions consistent.
  • Overfitting: fewer steps, lower `rank`, more pairs.
Validation Prompt Tips
  • Use a reference clip the LoRA has not seen to gauge generalization.
  • Match the caption style and trigger phrase you trained with.
Common Pitfalls
  • Missing or mismatched `_start`/`_end` pairs.
  • `audio_duration_seconds` set so long that most clips are skipped.
  • Reference/target pairs that differ by more than the intended transformation.