fal-ai/ltx23-trainer-v2/audio-extend-prefix

Train a LoRA that continues an audio clip forward in time, generating the audio that follows a short clean prefix.
Training
Commercial use

Input

Additional Settings

Customize your input with more control.

The cost of training depends on the number of steps. The formula is: 0.0022 * steps. With 1000 steps, your request will cost $2.20.

Training history

Note: these are the most recent training requests. For the full history, check the requests tab.

LTX 2.3 Trainer — Forward Audio Extension (`/audio-extend-prefix`)

Overview

The `/audio-extend-prefix` endpoint trains a LoRA for the LTX 2.3 model that continues an audio clip forward in time. During training, the first few seconds of each clip are kept as a clean "prefix" and the model learns to generate the audio that follows. At inference you supply an opening audio clip and the model produces the continuation. This is audio-only.

Key features:

  • Learns to extend audio forward from a clean opening (prefix) window.
  • Audio-only — no video is processed or generated.
  • The prefix is carved from each clip's own audio automatically.
  • Fixed audio length bucket via `audio_duration_seconds`.
  • Validation previews continue a supplied clip forward; the output preview is audio.

Dataset Format

Provide a single `.zip` archive (linked via `training_data_url`) of plain audio clips:

  • Audio: `.wav`, `.mp3`, `.flac`, `.ogg`, `.aac`, `.m4a`
  • Captions: a `.txt` with the same base name as each audio clip (optional but recommended).

File names must be unique across the archive. Aim for at least 10 clips. Clips shorter than `audio_duration_seconds` are skipped, so make each clip at least that long — and since `conditioning_seconds` must be smaller than `audio_duration_seconds`, every kept clip leaves a continuation to learn.

Input Parameters Reference

Dataset
`training_data_url` (required)

Type: `string`

URL to the `.zip` archive of audio clips.

`trigger_phrase`

Type: `string` Default: `""`

Phrase prepended to captions during training; include it at inference.

Training Parameters
`rank`

Type: `integer` (`8`, `16`, `32`, `64`, `128`) Default: `32`

LoRA capacity.

`number_of_steps`

Type: `integer` Default: `2000` (range `100``20000`)

Number of optimization steps.

`learning_rate`

Type: `number` Default: `0.0002`

Optimization step size.

`conditioning_seconds`

Type: `number` Default: `1.0` (range `0.1``30.0`)

Seconds of leading audio kept as the clean prefix the model continues from. Must be less than `audio_duration_seconds` so there is a continuation to generate (and not so close to it that nothing is left after rounding).

Audio Configuration
`audio_duration_seconds`

Type: `number` Default: `5.0` (range `0.5``60.0`)

Target audio clip length in seconds (the audio duration bucket). Clips shorter than this are skipped; longer clips are trimmed.

`audio_normalize`

Type: `boolean` Default: `true`

Peak-normalize audio for consistent loudness across the dataset.

`audio_preserve_pitch`

Type: `boolean` Default: `true`

Preserve pitch when fitting audio to the target duration.

Validation
`validation`

Type: `array` Default: `[]` (max 2 entries)

Validation samples, each an object with:

  • `prompt` (`string`) — the text prompt.
  • `audio_url` (`string`, required) — an audio clip to extend forward. Its opening `conditioning_seconds` are used as the prefix.

The validation clip must be at least `conditioning_seconds` long. When validation prompts are provided, `conditioning_seconds` is also bounded by the preview length: the validation preview generates an audio span sized from `audio_duration_seconds` (capped at 121 frames divided by `validation_frame_rate`), and a `conditioning_seconds` that would fill that span is rejected at request time (HTTP 422). A very low `validation_frame_rate` shrinks this preview window, so keep `conditioning_seconds` comfortably below `audio_duration_seconds`.

`validation_negative_prompt`

Type: `string` Default: a built-in quality negative prompt.

`validation_frame_rate`

Type: `integer` Default: `24` (range `8``60`)

Used together with the audio bucket to size the preview audio length.

`stg_scale`

Type: `number` Default: `1.0` (range `0.0``3.0`)

`debug_dataset`

Type: `boolean` Default: `false`

Return an archive of the preprocessed data for inspection.

Note: this audio-only mode also accepts the shared video/validation-video fields, but they have no effect — no video is processed.

Outputs

  • `lora_file` — the trained LoRA weights (`.safetensors`).
  • `config_file` — JSON describing the trigger phrase and training type.
  • `audio` — a combined preview of the generated validation audio.
  • `debug_dataset` — preprocessed-data archive, only when `debug_dataset` is enabled.

Billing

A successful run is billed `max(100, number_of_steps)` billable units. Requests that fail before training completes (input-validation errors / HTTP 422, or dataset-download failures) are billed 0 units.

How the Training Works

Pipeline Overview
  1. Preprocessing — the archive is extracted, audio matched to captions, and each clip fit to the `audio_duration_seconds` bucket.
  2. Training — for each clip, the opening `conditioning_seconds` of audio are held clean as the prefix and the model learns to generate the rest. Validation previews run at intervals.
  3. Output — the LoRA, config, and combined audio preview are uploaded.
What Happens to Your Data
  • Archive extraction: the `.zip` is unpacked; macOS metadata and hidden files are ignored.
  • File matching: each audio clip is paired with the `.txt` of the same base name.
  • Audio fitting: each clip is fit to the `audio_duration_seconds` bucket (shorter clips skipped, longer ones trimmed; normalized and pitch-fit as configured).
  • Prefix carving: the opening `conditioning_seconds` of each clip are kept clean as conditioning; the remainder is the generation target. This happens automatically.
  • Captions: the trigger phrase (if set) is prepended.

Tips for Getting Good Results

Dataset Quality
  • Use at least 10 clean clips containing the kind of forward continuation you want learned.
  • Clips should be meaningfully longer than `conditioning_seconds`.
  • Pick an `audio_duration_seconds` that fits most clips so few are skipped.
Caption Best Practices
  • Describe the sound and continuation plainly, optionally with a trigger phrase.
  • Keep captions consistent across clips.

Good caption: `a piano melody continuing into a flowing arpeggio` Weak caption: `piano`

Trigger Phrases
  • Use a distinctive trigger phrase to invoke a particular continuation style; include it in every caption and at inference.
Inference Format Matching

At inference, supply an opening audio clip at least `conditioning_seconds` long, use the same caption style and trigger phrase, and keep durations in line with `audio_duration_seconds`.

json
{
  "training_data_url": "https://example.com/audio_clips.zip",
  "trigger_phrase": "",
  "rank": 32,
  "number_of_steps": 2000,
  "learning_rate": 0.0002,
  "conditioning_seconds": 1.0,
  "audio_duration_seconds": 5.0,
  "validation": [
    { "prompt": "a piano melody continuing", "audio_url": "https://example.com/opening.wav" }
  ]
}
Diagnosing Issues
  • Continuation drifts from the prefix: add more representative clips; try a slightly longer `conditioning_seconds` for more context.
  • Validation rejected for short clip: provide a clip at least `conditioning_seconds` long, or lower `conditioning_seconds`.
  • Window covers the whole clip: reduce `conditioning_seconds` relative to `audio_duration_seconds`.
  • Overfitting: fewer steps, lower `rank`, more clips.
Validation Prompt Tips
  • Use a fresh opening clip to gauge generalization.
  • Keep the prompt aligned with the continuation you expect.
Common Pitfalls
  • `conditioning_seconds` too close to (or above) `audio_duration_seconds` — nothing left to generate.
  • `audio_duration_seconds` so long that most clips are skipped.
  • Background noise polluting the learned continuation.