fal-ai/ltx23-trainer-v2/audio-extend-suffix

Train a LoRA that generates the lead-in to an audio clip, extending audio backward in time from its ending.
Training
Commercial use

Input

Additional Settings

Customize your input with more control.

The cost of training depends on the number of steps. The formula is: 0.0023 * steps. With 1000 steps, your request will cost $2.30.

Training history

Note: these are the most recent training requests. For the full history, check the requests tab.

LTX 2.3 Trainer — Backward Audio Extension (`/audio-extend-suffix`)

Overview

The `/audio-extend-suffix` endpoint trains a LoRA for the LTX 2.3 model that generates the lead-in to an audio clip — it extends audio backward in time. During training, the last few seconds of each clip are kept as a clean "suffix" and the model learns to generate the audio leading up to them. At inference you supply a closing audio clip and the model produces the preceding section. This is audio-only.

Key features:

  • Learns to extend audio backward, generating a plausible lead-in to a clean closing (suffix) window.
  • Audio-only — no video is processed or generated.
  • The suffix is carved from each clip's own audio automatically.
  • Fixed audio length bucket via `audio_duration_seconds`.
  • Validation previews extend a supplied clip backward; the output preview is audio.

Dataset Format

Provide a single `.zip` archive (linked via `training_data_url`) of plain audio clips:

  • Audio: `.wav`, `.mp3`, `.flac`, `.ogg`, `.aac`, `.m4a`
  • Captions: a `.txt` with the same base name as each audio clip (optional but recommended).

File names must be unique across the archive. Aim for at least 10 clips. Clips shorter than `audio_duration_seconds` are skipped, so make each clip at least that long — and since `conditioning_seconds` must be smaller than `audio_duration_seconds`, every kept clip leaves a lead-in to learn.

Input Parameters Reference

Dataset
`training_data_url` (required)

Type: `string`

URL to the `.zip` archive of audio clips.

`trigger_phrase`

Type: `string` Default: `""`

Phrase prepended to captions during training; include it at inference.

Training Parameters
`rank`

Type: `integer` (`8`, `16`, `32`, `64`, `128`) Default: `32`

LoRA capacity.

`number_of_steps`

Type: `integer` Default: `2000` (range `100``20000`)

Number of optimization steps.

`learning_rate`

Type: `number` Default: `0.0002`

Optimization step size.

`conditioning_seconds`

Type: `number` Default: `1.0` (range `0.1``30.0`)

Seconds of trailing audio kept as the clean suffix the model leads up to. Must be less than `audio_duration_seconds` so there is a lead-in to generate (and not so close to it that nothing is left after rounding).

Audio Configuration
`audio_duration_seconds`

Type: `number` Default: `5.0` (range `0.5``60.0`)

Target audio clip length in seconds (the audio duration bucket). Clips shorter than this are skipped; longer clips are trimmed.

`audio_normalize`

Type: `boolean` Default: `true`

Peak-normalize audio for consistent loudness across the dataset.

`audio_preserve_pitch`

Type: `boolean` Default: `true`

Preserve pitch when fitting audio to the target duration.

Validation
`validation`

Type: `array` Default: `[]` (max 2 entries)

Validation samples, each an object with:

  • `prompt` (`string`) — the text prompt.
  • `audio_url` (`string`, required) — an audio clip to extend backward. Its closing `conditioning_seconds` are used as the suffix.

The validation clip must be at least `conditioning_seconds` long. When validation prompts are provided, `conditioning_seconds` is also bounded by the preview length: the validation preview generates an audio span sized from `audio_duration_seconds` (capped at 121 frames divided by `validation_frame_rate`), and a `conditioning_seconds` that would fill that span is rejected at request time (HTTP 422). A very low `validation_frame_rate` shrinks this preview window, so keep `conditioning_seconds` comfortably below `audio_duration_seconds`.

`validation_negative_prompt`

Type: `string` Default: a built-in quality negative prompt.

`validation_frame_rate`

Type: `integer` Default: `24` (range `8``60`)

Used together with the audio bucket to size the preview audio length.

`stg_scale`

Type: `number` Default: `1.0` (range `0.0``3.0`)

`debug_dataset`

Type: `boolean` Default: `false`

Return an archive of the preprocessed data for inspection.

Note: this audio-only mode also accepts the shared video/validation-video fields, but they have no effect — no video is processed.

Outputs

  • `lora_file` — the trained LoRA weights (`.safetensors`).
  • `config_file` — JSON describing the trigger phrase and training type.
  • `audio` — a combined preview of the generated validation audio.
  • `debug_dataset` — preprocessed-data archive, only when `debug_dataset` is enabled.

Billing

A successful run is billed `max(100, number_of_steps)` billable units. Requests that fail before training completes (input-validation errors / HTTP 422, or dataset-download failures) are billed 0 units.

How the Training Works

Pipeline Overview
  1. Preprocessing — the archive is extracted, audio matched to captions, and each clip fit to the `audio_duration_seconds` bucket.
  2. Training — for each clip, the closing `conditioning_seconds` of audio are held clean as the suffix and the model learns to generate the preceding lead-in. Validation previews run at intervals.
  3. Output — the LoRA, config, and combined audio preview are uploaded.
What Happens to Your Data
  • Archive extraction: the `.zip` is unpacked; macOS metadata and hidden files are ignored.
  • File matching: each audio clip is paired with the `.txt` of the same base name.
  • Audio fitting: each clip is fit to the `audio_duration_seconds` bucket (shorter clips skipped, longer ones trimmed; normalized and pitch-fit as configured).
  • Suffix carving: the closing `conditioning_seconds` of each clip are kept clean as conditioning; everything before is the generation target. This happens automatically.
  • Captions: the trigger phrase (if set) is prepended.

Tips for Getting Good Results

Dataset Quality
  • Use at least 10 clean clips containing the kind of backward continuation (lead-in) you want learned.
  • Clips should be meaningfully longer than `conditioning_seconds`.
  • Pick an `audio_duration_seconds` that fits most clips so few are skipped.
Caption Best Practices
  • Describe the sound and lead-in plainly, optionally with a trigger phrase.
  • Keep captions consistent across clips.

Good caption: `a drum fill building up to a cymbal crash` Weak caption: `drums`

Trigger Phrases
  • Use a distinctive trigger phrase to invoke a particular lead-in style; include it in every caption and at inference.
Inference Format Matching

At inference, supply a closing audio clip at least `conditioning_seconds` long, use the same caption style and trigger phrase, and keep durations in line with `audio_duration_seconds`.

json
{
  "training_data_url": "https://example.com/audio_clips.zip",
  "trigger_phrase": "",
  "rank": 32,
  "number_of_steps": 2000,
  "learning_rate": 0.0002,
  "conditioning_seconds": 1.0,
  "audio_duration_seconds": 5.0,
  "validation": [
    { "prompt": "a drum fill building up", "audio_url": "https://example.com/closing.wav" }
  ]
}
Diagnosing Issues
  • Lead-in does not flow into the suffix: add more representative clips; try a slightly longer `conditioning_seconds` for more closing context.
  • Validation rejected for short clip: provide a clip at least `conditioning_seconds` long, or lower `conditioning_seconds`.
  • Window covers the whole clip: reduce `conditioning_seconds` relative to `audio_duration_seconds`.
  • Overfitting: fewer steps, lower `rank`, more clips.
Validation Prompt Tips
  • Use a fresh closing clip to gauge generalization.
  • Keep the prompt aligned with the lead-in you expect.
Common Pitfalls
  • `conditioning_seconds` too close to (or above) `audio_duration_seconds` — nothing left to generate.
  • `audio_duration_seconds` so long that most clips are skipped.
  • Background noise polluting the learned lead-in.