LTX 2.3 Trainer (V2) - Audio Inpainting (Training) API on fal

LTX 2.3 Trainer — Audio Inpainting (`/audio-inpaint`)

Overview

The `/audio-inpaint` endpoint trains a LoRA for the LTX 2.3 model that regenerates masked time spans of an audio clip while keeping the rest unchanged. Each training clip is paired with a list of time ranges marking the spans to regenerate; the model learns to fill those spans so they blend with the kept audio. At inference you supply an audio clip and one or more time ranges, and the model regenerates only those spans. This is audio-only.

Key features:

Learns to regenerate masked time spans of audio while preserving the rest.
Time spans are specified per clip as a list of `[start, end]` second ranges.
Audio-only — no video is processed or generated.
Fixed audio length bucket via `audio_duration_seconds`.
Validation previews inpaint a supplied clip over supplied time ranges; the output preview is audio.

Dataset Format

Provide a single `.zip` archive (linked via `training_data_url`) where each example is an audio clip plus a JSON mask:

`<name>.<ext>` — the audio clip (`.wav`, `.mp3`, `.flac`, `.ogg`, `.aac`, `.m4a`).
`<name>_mask.json` — a JSON list of `[start, end]` second ranges to regenerate (the rest is kept), e.g. `[[1.0, 2.0], [3.5, 4.0]]`.
`<name>.txt` — optional caption.

Every clip needs a matching `<name>_mask.json`. The mask must mark at least one non-empty range inside the clip duration. File names must be unique across the archive. Aim for at least 10 examples.

Example layout:


clip01.wav   clip01_mask.json   clip01.txt
clip02.mp3   clip02_mask.json   clip02.txt

At least one clip must fill the audio bucket. A clip shorter than `audio_duration_seconds` is skipped; if every clip is shorter, the request is rejected up front (HTTP 422). Lower `audio_duration_seconds` if your clips are short.

Input Parameters Reference

Dataset

`training_data_url` (required)

Type: `string`

URL to the `.zip` archive of audio + mask-JSON examples.

`trigger_phrase`

Type: `string` Default: `""`

Phrase prepended to captions during training; include it at inference.

Training Parameters

`rank`

Type: `integer` (`8`, `16`, `32`, `64`, `128`) Default: `32`

LoRA capacity.

`number_of_steps`

Type: `integer` Default: `2000` (range `100`–`20000`)

Number of optimization steps.

`learning_rate`

Type: `number` Default: `0.0002`

Optimization step size.

Audio Configuration

`audio_duration_seconds`

Type: `number` Default: `5.0` (range `0.5`–`60.0`)

Target audio clip length in seconds (the audio duration bucket). Clips shorter than this are skipped; longer clips are trimmed. The mask time ranges are interpreted within this duration.

`audio_normalize`

Type: `boolean` Default: `true`

Peak-normalize audio for consistent loudness across the dataset.

`audio_preserve_pitch`

Type: `boolean` Default: `true`

Preserve pitch when fitting audio to the target duration.

Validation

`validation`

Type: `array` Default: `[]` (max 2 entries)

Validation samples, each an object with:

`prompt` (`string`) — the text prompt.
`audio_url` (`string`, required) — the audio clip to inpaint.
`time_ranges` (`array`, required) — a list of `[start, end]` second ranges to regenerate, e.g. `[[1.0, 2.0]]`. The rest of the audio is kept.

`validation_negative_prompt`

Type: `string` Default: a built-in quality negative prompt.

`validation_frame_rate`

Type: `integer` Default: `24` (range `8`–`60`)

Used together with the audio bucket to size the preview audio length (and therefore where the time ranges land).

`stg_scale`

Type: `number` Default: `1.0` (range `0.0`–`3.0`)

`debug_dataset`

Type: `boolean` Default: `false`

Return an archive of the preprocessed data for inspection.

Note: this audio-only mode also accepts the shared video/validation-video fields, but they have no effect — no video is processed.

Outputs

`lora_file` — the trained LoRA weights (`.safetensors`).
`config_file` — JSON describing the trigger phrase and training type.
`audio` — a combined preview of the generated validation audio.
`debug_dataset` — preprocessed-data archive, only when `debug_dataset` is enabled.

Billing

A successful run is billed `max(100, number_of_steps)` billable units. Requests that fail before training completes (input-validation errors / HTTP 422, or dataset-download failures) are billed 0 units.

How the Training Works

Pipeline Overview

Preprocessing — the archive is extracted, audio matched to its mask JSON and caption, each clip fit to the `audio_duration_seconds` bucket, and each mask's time ranges converted into a per-clip mask over the clip duration.
Training — the model regenerates the masked time spans while the kept audio is held as conditioning. Validation previews run at intervals.
Output — the LoRA, config, and combined audio preview are uploaded.

What Happens to Your Data

Archive extraction: the `.zip` is unpacked; macOS metadata and hidden files are ignored.
Clip/mask matching: each `<name>` clip is paired with its `<name>_mask.json` time ranges and optional `<name>.txt`. A clip without a mask JSON, an unparsable mask, or a mask that marks nothing causes a clear error.
Audio fitting: each clip is fit to the `audio_duration_seconds` bucket (shorter clips skipped, longer ones trimmed; normalized and pitch-fit as configured).
Mask interpretation: the `[start, end]` second ranges become the spans to regenerate; everything else is kept.
Captions: the trigger phrase (if set) is prepended.

Tips for Getting Good Results

Dataset Quality

Use at least 10 examples representative of the kinds of spans you want regenerated.
Mark spans that genuinely need regenerating; leave clean audio outside them.
Pick an `audio_duration_seconds` that fits most clips so few are skipped, and keep your time ranges inside that duration.

Mask Best Practices

The mask JSON lists ranges to regenerate; everything outside is kept.
Provide at least one non-empty range per clip, with `start < end`, inside the clip duration.
Ranges entirely outside the clip duration are ignored — make sure at least one falls inside.

Caption Best Practices

Describe what should appear in the regenerated spans (and the overall sound) plainly, optionally with a trigger phrase.
Keep captions consistent across examples.

Good caption: `a steady drumbeat with the snare replaced by a clap` Weak caption: `drums`

Inference Format Matching

At inference, supply an audio clip and time ranges in seconds, and use the same caption style and trigger phrase. Keep durations and ranges in line with `audio_duration_seconds`.

Recommended Starting Configuration

json
{
  "training_data_url": "https://example.com/audio_inpaint_examples.zip",
  "trigger_phrase": "",
  "rank": 32,
  "number_of_steps": 2000,
  "learning_rate": 0.0002,
  "audio_duration_seconds": 5.0,
  "validation": [
    {
      "prompt": "a steady drumbeat",
      "audio_url": "https://example.com/source.wav",
      "time_ranges": [[1.0, 2.0]]
    }
  ]
}

Diagnosing Issues

Regenerated spans do not blend: add more examples; describe the desired content in captions; keep ranges well inside the clip.
Mask marks no region error: ensure each mask JSON has at least one `[start, end]` with `start < end` inside the clip duration.
Many clips skipped: lower `audio_duration_seconds` to match your clips.
Overfitting: fewer steps, lower `rank`, more examples.

Validation Prompt Tips

Use a fresh clip and time ranges to gauge generalization.
Describe the content you expect in the regenerated spans.

Common Pitfalls

Empty, zero-length, or out-of-range mask time ranges (no region to regenerate).
Missing `<name>_mask.json` files or malformed JSON.
Time ranges outside the `audio_duration_seconds` window.

fal-ai/ltx23-trainer-v2/audio-inpaint

Input

Training history

Nothing here yet...