LTX 2.3 Trainer (V2) - Backward Audio Extension (Training) API on fal

LTX 2.3 Trainer — Backward Audio Extension (`/audio-extend-suffix`)

Overview

The `/audio-extend-suffix` endpoint trains a LoRA for the LTX 2.3 model that generates the lead-in to an audio clip — it extends audio backward in time. During training, the last few seconds of each clip are kept as a clean "suffix" and the model learns to generate the audio leading up to them. At inference you supply a closing audio clip and the model produces the preceding section. This is audio-only.

Key features:

Learns to extend audio backward, generating a plausible lead-in to a clean closing (suffix) window.
Audio-only — no video is processed or generated.
The suffix is carved from each clip's own audio automatically.
Fixed audio length bucket via `audio_duration_seconds`.
Validation previews extend a supplied clip backward; the output preview is audio.

Dataset Format

Provide a single `.zip` archive (linked via `training_data_url`) of plain audio clips:

Audio: `.wav`, `.mp3`, `.flac`, `.ogg`, `.aac`, `.m4a`
Captions: a `.txt` with the same base name as each audio clip (optional but recommended).

File names must be unique across the archive. Aim for at least 10 clips. Clips shorter than `audio_duration_seconds` are skipped, so make each clip at least that long — and since `conditioning_seconds` must be smaller than `audio_duration_seconds`, every kept clip leaves a lead-in to learn.

Input Parameters Reference

Dataset

`training_data_url` (required)

Type: `string`

URL to the `.zip` archive of audio clips.

`trigger_phrase`

Type: `string` Default: `""`

Phrase prepended to captions during training; include it at inference.

Training Parameters

`rank`

Type: `integer` (`8`, `16`, `32`, `64`, `128`) Default: `32`

LoRA capacity.

`number_of_steps`

Type: `integer` Default: `2000` (range `100`–`20000`)

Number of optimization steps.

`learning_rate`

Type: `number` Default: `0.0002`

Optimization step size.

`conditioning_seconds`

Type: `number` Default: `1.0` (range `0.1`–`30.0`)

Seconds of trailing audio kept as the clean suffix the model leads up to. Must be less than `audio_duration_seconds` so there is a lead-in to generate (and not so close to it that nothing is left after rounding).

Audio Configuration

`audio_duration_seconds`

Type: `number` Default: `5.0` (range `0.5`–`60.0`)

Target audio clip length in seconds (the audio duration bucket). Clips shorter than this are skipped; longer clips are trimmed.

`audio_normalize`

Type: `boolean` Default: `true`

Peak-normalize audio for consistent loudness across the dataset.

`audio_preserve_pitch`

Type: `boolean` Default: `true`

Preserve pitch when fitting audio to the target duration.

Validation

`validation`

Type: `array` Default: `[]` (max 2 entries)

Validation samples, each an object with:

`prompt` (`string`) — the text prompt.
`audio_url` (`string`, required) — an audio clip to extend backward. Its closing `conditioning_seconds` are used as the suffix.

The validation clip must be at least `conditioning_seconds` long. When validation prompts are provided, `conditioning_seconds` is also bounded by the preview length: the validation preview generates an audio span sized from `audio_duration_seconds` (capped at 121 frames divided by `validation_frame_rate`), and a `conditioning_seconds` that would fill that span is rejected at request time (HTTP 422). A very low `validation_frame_rate` shrinks this preview window, so keep `conditioning_seconds` comfortably below `audio_duration_seconds`.

`validation_negative_prompt`

Type: `string` Default: a built-in quality negative prompt.

`validation_frame_rate`

Type: `integer` Default: `24` (range `8`–`60`)

Used together with the audio bucket to size the preview audio length.

`stg_scale`

Type: `number` Default: `1.0` (range `0.0`–`3.0`)

`debug_dataset`

Type: `boolean` Default: `false`

Return an archive of the preprocessed data for inspection.

Note: this audio-only mode also accepts the shared video/validation-video fields, but they have no effect — no video is processed.

Outputs

`lora_file` — the trained LoRA weights (`.safetensors`).
`config_file` — JSON describing the trigger phrase and training type.
`audio` — a combined preview of the generated validation audio.
`debug_dataset` — preprocessed-data archive, only when `debug_dataset` is enabled.

Billing

A successful run is billed `max(100, number_of_steps)` billable units. Requests that fail before training completes (input-validation errors / HTTP 422, or dataset-download failures) are billed 0 units.

How the Training Works

Pipeline Overview

Preprocessing — the archive is extracted, audio matched to captions, and each clip fit to the `audio_duration_seconds` bucket.
Training — for each clip, the closing `conditioning_seconds` of audio are held clean as the suffix and the model learns to generate the preceding lead-in. Validation previews run at intervals.
Output — the LoRA, config, and combined audio preview are uploaded.

What Happens to Your Data

Archive extraction: the `.zip` is unpacked; macOS metadata and hidden files are ignored.
File matching: each audio clip is paired with the `.txt` of the same base name.
Audio fitting: each clip is fit to the `audio_duration_seconds` bucket (shorter clips skipped, longer ones trimmed; normalized and pitch-fit as configured).
Suffix carving: the closing `conditioning_seconds` of each clip are kept clean as conditioning; everything before is the generation target. This happens automatically.
Captions: the trigger phrase (if set) is prepended.

Tips for Getting Good Results

Dataset Quality

Use at least 10 clean clips containing the kind of backward continuation (lead-in) you want learned.
Clips should be meaningfully longer than `conditioning_seconds`.
Pick an `audio_duration_seconds` that fits most clips so few are skipped.

Caption Best Practices

Describe the sound and lead-in plainly, optionally with a trigger phrase.
Keep captions consistent across clips.

Good caption: `a drum fill building up to a cymbal crash` Weak caption: `drums`

Trigger Phrases

Use a distinctive trigger phrase to invoke a particular lead-in style; include it in every caption and at inference.

Inference Format Matching

At inference, supply a closing audio clip at least `conditioning_seconds` long, use the same caption style and trigger phrase, and keep durations in line with `audio_duration_seconds`.

Recommended Starting Configuration

json
{
  "training_data_url": "https://example.com/audio_clips.zip",
  "trigger_phrase": "",
  "rank": 32,
  "number_of_steps": 2000,
  "learning_rate": 0.0002,
  "conditioning_seconds": 1.0,
  "audio_duration_seconds": 5.0,
  "validation": [
    { "prompt": "a drum fill building up", "audio_url": "https://example.com/closing.wav" }
  ]
}

Diagnosing Issues

Lead-in does not flow into the suffix: add more representative clips; try a slightly longer `conditioning_seconds` for more closing context.
Validation rejected for short clip: provide a clip at least `conditioning_seconds` long, or lower `conditioning_seconds`.
Window covers the whole clip: reduce `conditioning_seconds` relative to `audio_duration_seconds`.
Overfitting: fewer steps, lower `rank`, more clips.

Validation Prompt Tips

Use a fresh closing clip to gauge generalization.
Keep the prompt aligned with the lead-in you expect.

Common Pitfalls

`conditioning_seconds` too close to (or above) `audio_duration_seconds` — nothing left to generate.
`audio_duration_seconds` so long that most clips are skipped.
Background noise polluting the learned lead-in.

fal-ai/ltx23-trainer-v2/audio-extend-suffix

Input

Training history

Nothing here yet...