LTX 2.3 Trainer (V2) - Forward Audio Extension (Training) API on fal

LTX 2.3 Trainer — Forward Audio Extension (`/audio-extend-prefix`)

Overview

The `/audio-extend-prefix` endpoint trains a LoRA for the LTX 2.3 model that continues an audio clip forward in time. During training, the first few seconds of each clip are kept as a clean "prefix" and the model learns to generate the audio that follows. At inference you supply an opening audio clip and the model produces the continuation. This is audio-only.

Key features:

Learns to extend audio forward from a clean opening (prefix) window.
Audio-only — no video is processed or generated.
The prefix is carved from each clip's own audio automatically.
Fixed audio length bucket via `audio_duration_seconds`.
Validation previews continue a supplied clip forward; the output preview is audio.

Dataset Format

Provide a single `.zip` archive (linked via `training_data_url`) of plain audio clips:

Audio: `.wav`, `.mp3`, `.flac`, `.ogg`, `.aac`, `.m4a`
Captions: a `.txt` with the same base name as each audio clip (optional but recommended).

File names must be unique across the archive. Aim for at least 10 clips. Clips shorter than `audio_duration_seconds` are skipped, so make each clip at least that long — and since `conditioning_seconds` must be smaller than `audio_duration_seconds`, every kept clip leaves a continuation to learn.

Input Parameters Reference

Dataset

`training_data_url` (required)

Type: `string`

URL to the `.zip` archive of audio clips.

`trigger_phrase`

Type: `string` Default: `""`

Phrase prepended to captions during training; include it at inference.

Training Parameters

`rank`

Type: `integer` (`8`, `16`, `32`, `64`, `128`) Default: `32`

LoRA capacity.

`number_of_steps`

Type: `integer` Default: `2000` (range `100`–`20000`)

Number of optimization steps.

`learning_rate`

Type: `number` Default: `0.0002`

Optimization step size.

`conditioning_seconds`

Type: `number` Default: `1.0` (range `0.1`–`30.0`)

Seconds of leading audio kept as the clean prefix the model continues from. Must be less than `audio_duration_seconds` so there is a continuation to generate (and not so close to it that nothing is left after rounding).

Audio Configuration

`audio_duration_seconds`

Type: `number` Default: `5.0` (range `0.5`–`60.0`)

Target audio clip length in seconds (the audio duration bucket). Clips shorter than this are skipped; longer clips are trimmed.

`audio_normalize`

Type: `boolean` Default: `true`

Peak-normalize audio for consistent loudness across the dataset.

`audio_preserve_pitch`

Type: `boolean` Default: `true`

Preserve pitch when fitting audio to the target duration.

Validation

`validation`

Type: `array` Default: `[]` (max 2 entries)

Validation samples, each an object with:

`prompt` (`string`) — the text prompt.
`audio_url` (`string`, required) — an audio clip to extend forward. Its opening `conditioning_seconds` are used as the prefix.

The validation clip must be at least `conditioning_seconds` long. When validation prompts are provided, `conditioning_seconds` is also bounded by the preview length: the validation preview generates an audio span sized from `audio_duration_seconds` (capped at 121 frames divided by `validation_frame_rate`), and a `conditioning_seconds` that would fill that span is rejected at request time (HTTP 422). A very low `validation_frame_rate` shrinks this preview window, so keep `conditioning_seconds` comfortably below `audio_duration_seconds`.

`validation_negative_prompt`

Type: `string` Default: a built-in quality negative prompt.

`validation_frame_rate`

Type: `integer` Default: `24` (range `8`–`60`)

Used together with the audio bucket to size the preview audio length.

`stg_scale`

Type: `number` Default: `1.0` (range `0.0`–`3.0`)

`debug_dataset`

Type: `boolean` Default: `false`

Return an archive of the preprocessed data for inspection.

Note: this audio-only mode also accepts the shared video/validation-video fields, but they have no effect — no video is processed.

Outputs

`lora_file` — the trained LoRA weights (`.safetensors`).
`config_file` — JSON describing the trigger phrase and training type.
`audio` — a combined preview of the generated validation audio.
`debug_dataset` — preprocessed-data archive, only when `debug_dataset` is enabled.

Billing

A successful run is billed `max(100, number_of_steps)` billable units. Requests that fail before training completes (input-validation errors / HTTP 422, or dataset-download failures) are billed 0 units.

How the Training Works

Pipeline Overview

Preprocessing — the archive is extracted, audio matched to captions, and each clip fit to the `audio_duration_seconds` bucket.
Training — for each clip, the opening `conditioning_seconds` of audio are held clean as the prefix and the model learns to generate the rest. Validation previews run at intervals.
Output — the LoRA, config, and combined audio preview are uploaded.

What Happens to Your Data

Archive extraction: the `.zip` is unpacked; macOS metadata and hidden files are ignored.
File matching: each audio clip is paired with the `.txt` of the same base name.
Audio fitting: each clip is fit to the `audio_duration_seconds` bucket (shorter clips skipped, longer ones trimmed; normalized and pitch-fit as configured).
Prefix carving: the opening `conditioning_seconds` of each clip are kept clean as conditioning; the remainder is the generation target. This happens automatically.
Captions: the trigger phrase (if set) is prepended.

Tips for Getting Good Results

Dataset Quality

Use at least 10 clean clips containing the kind of forward continuation you want learned.
Clips should be meaningfully longer than `conditioning_seconds`.
Pick an `audio_duration_seconds` that fits most clips so few are skipped.

Caption Best Practices

Describe the sound and continuation plainly, optionally with a trigger phrase.
Keep captions consistent across clips.

Good caption: `a piano melody continuing into a flowing arpeggio` Weak caption: `piano`

Trigger Phrases

Use a distinctive trigger phrase to invoke a particular continuation style; include it in every caption and at inference.

Inference Format Matching

At inference, supply an opening audio clip at least `conditioning_seconds` long, use the same caption style and trigger phrase, and keep durations in line with `audio_duration_seconds`.

Recommended Starting Configuration

json
{
  "training_data_url": "https://example.com/audio_clips.zip",
  "trigger_phrase": "",
  "rank": 32,
  "number_of_steps": 2000,
  "learning_rate": 0.0002,
  "conditioning_seconds": 1.0,
  "audio_duration_seconds": 5.0,
  "validation": [
    { "prompt": "a piano melody continuing", "audio_url": "https://example.com/opening.wav" }
  ]
}

Diagnosing Issues

Continuation drifts from the prefix: add more representative clips; try a slightly longer `conditioning_seconds` for more context.
Validation rejected for short clip: provide a clip at least `conditioning_seconds` long, or lower `conditioning_seconds`.
Window covers the whole clip: reduce `conditioning_seconds` relative to `audio_duration_seconds`.
Overfitting: fewer steps, lower `rank`, more clips.

Validation Prompt Tips

Use a fresh opening clip to gauge generalization.
Keep the prompt aligned with the continuation you expect.

Common Pitfalls

`conditioning_seconds` too close to (or above) `audio_duration_seconds` — nothing left to generate.
`audio_duration_seconds` so long that most clips are skipped.
Background noise polluting the learned continuation.

fal-ai/ltx23-trainer-v2/audio-extend-prefix

Input

Training history

Nothing here yet...