fal-ai/ltx23-trainer-v2/audio-extend-suffix
Input
Hint: Upload a prepared .zip archive or provide a URL. See the field description for the required file layout.
Customize your input with more control.
The cost of training depends on the number of steps. The formula is: 0.0023 * steps. With 1000 steps, your request will cost $2.30.
Training history
Nothing here yet...
Fine-tune your training parameters and start right now.
LTX 2.3 Trainer — Backward Audio Extension (`/audio-extend-suffix`)
Overview
The `/audio-extend-suffix` endpoint trains a LoRA for the LTX 2.3 model that generates the lead-in to an audio clip — it extends audio backward in time. During training, the last few seconds of each clip are kept as a clean "suffix" and the model learns to generate the audio leading up to them. At inference you supply a closing audio clip and the model produces the preceding section. This is audio-only.
Key features:
- Learns to extend audio backward, generating a plausible lead-in to a clean closing (suffix) window.
- Audio-only — no video is processed or generated.
- The suffix is carved from each clip's own audio automatically.
- Fixed audio length bucket via
`audio_duration_seconds`. - Validation previews extend a supplied clip backward; the output preview is audio.
Dataset Format
Provide a single `.zip` archive (linked via `training_data_url`) of plain audio clips:
- Audio:
`.wav`,`.mp3`,`.flac`,`.ogg`,`.aac`,`.m4a` - Captions: a
`.txt`with the same base name as each audio clip (optional but recommended).
File names must be unique across the archive. Aim for at least 10 clips. Clips shorter than `audio_duration_seconds` are skipped, so make each clip at least that long — and since `conditioning_seconds` must be smaller than `audio_duration_seconds`, every kept clip leaves a lead-in to learn.
Input Parameters Reference
Dataset
`training_data_url` (required)
Type: `string`
URL to the `.zip` archive of audio clips.
`trigger_phrase`
Type: `string`
Default: `""`
Phrase prepended to captions during training; include it at inference.
Training Parameters
`rank`
Type: `integer` (`8`, `16`, `32`, `64`, `128`)
Default: `32`
LoRA capacity.
`number_of_steps`
Type: `integer`
Default: `2000` (range `100`–`20000`)
Number of optimization steps.
`learning_rate`
Type: `number`
Default: `0.0002`
Optimization step size.
`conditioning_seconds`
Type: `number`
Default: `1.0` (range `0.1`–`30.0`)
Seconds of trailing audio kept as the clean suffix the model leads up to. Must be less than `audio_duration_seconds` so there is a lead-in to generate (and not so close to it that nothing is left after rounding).
Audio Configuration
`audio_duration_seconds`
Type: `number`
Default: `5.0` (range `0.5`–`60.0`)
Target audio clip length in seconds (the audio duration bucket). Clips shorter than this are skipped; longer clips are trimmed.
`audio_normalize`
Type: `boolean`
Default: `true`
Peak-normalize audio for consistent loudness across the dataset.
`audio_preserve_pitch`
Type: `boolean`
Default: `true`
Preserve pitch when fitting audio to the target duration.
Validation
`validation`
Type: `array`
Default: `[]` (max 2 entries)
Validation samples, each an object with:
`prompt`(`string`) — the text prompt.`audio_url`(`string`, required) — an audio clip to extend backward. Its closing`conditioning_seconds`are used as the suffix.
The validation clip must be at least `conditioning_seconds` long. When validation prompts are provided, `conditioning_seconds` is also bounded by the preview length: the validation preview generates an audio span sized from `audio_duration_seconds` (capped at 121 frames divided by `validation_frame_rate`), and a `conditioning_seconds` that would fill that span is rejected at request time (HTTP 422). A very low `validation_frame_rate` shrinks this preview window, so keep `conditioning_seconds` comfortably below `audio_duration_seconds`.
`validation_negative_prompt`
Type: `string`
Default: a built-in quality negative prompt.
`validation_frame_rate`
Type: `integer`
Default: `24` (range `8`–`60`)
Used together with the audio bucket to size the preview audio length.
`stg_scale`
Type: `number`
Default: `1.0` (range `0.0`–`3.0`)
`debug_dataset`
Type: `boolean`
Default: `false`
Return an archive of the preprocessed data for inspection.
Note: this audio-only mode also accepts the shared video/validation-video fields, but they have no effect — no video is processed.
Outputs
`lora_file`— the trained LoRA weights (`.safetensors`).`config_file`— JSON describing the trigger phrase and training type.`audio`— a combined preview of the generated validation audio.`debug_dataset`— preprocessed-data archive, only when`debug_dataset`is enabled.
Billing
A successful run is billed `max(100, number_of_steps)` billable units. Requests that fail before training completes (input-validation errors / HTTP 422, or dataset-download failures) are billed 0 units.
How the Training Works
Pipeline Overview
- Preprocessing — the archive is extracted, audio matched to captions, and each clip fit to the
`audio_duration_seconds`bucket. - Training — for each clip, the closing
`conditioning_seconds`of audio are held clean as the suffix and the model learns to generate the preceding lead-in. Validation previews run at intervals. - Output — the LoRA, config, and combined audio preview are uploaded.
What Happens to Your Data
- Archive extraction: the
`.zip`is unpacked; macOS metadata and hidden files are ignored. - File matching: each audio clip is paired with the
`.txt`of the same base name. - Audio fitting: each clip is fit to the
`audio_duration_seconds`bucket (shorter clips skipped, longer ones trimmed; normalized and pitch-fit as configured). - Suffix carving: the closing
`conditioning_seconds`of each clip are kept clean as conditioning; everything before is the generation target. This happens automatically. - Captions: the trigger phrase (if set) is prepended.
Tips for Getting Good Results
Dataset Quality
- Use at least 10 clean clips containing the kind of backward continuation (lead-in) you want learned.
- Clips should be meaningfully longer than
`conditioning_seconds`. - Pick an
`audio_duration_seconds`that fits most clips so few are skipped.
Caption Best Practices
- Describe the sound and lead-in plainly, optionally with a trigger phrase.
- Keep captions consistent across clips.
Good caption: `a drum fill building up to a cymbal crash`
Weak caption: `drums`
Trigger Phrases
- Use a distinctive trigger phrase to invoke a particular lead-in style; include it in every caption and at inference.
Inference Format Matching
At inference, supply a closing audio clip at least `conditioning_seconds` long, use the same caption style and trigger phrase, and keep durations in line with `audio_duration_seconds`.
Recommended Starting Configuration
json{ "training_data_url": "https://example.com/audio_clips.zip", "trigger_phrase": "", "rank": 32, "number_of_steps": 2000, "learning_rate": 0.0002, "conditioning_seconds": 1.0, "audio_duration_seconds": 5.0, "validation": [ { "prompt": "a drum fill building up", "audio_url": "https://example.com/closing.wav" } ] }
Diagnosing Issues
- Lead-in does not flow into the suffix: add more representative clips; try a slightly longer
`conditioning_seconds`for more closing context. - Validation rejected for short clip: provide a clip at least
`conditioning_seconds`long, or lower`conditioning_seconds`. - Window covers the whole clip: reduce
`conditioning_seconds`relative to`audio_duration_seconds`. - Overfitting: fewer steps, lower
`rank`, more clips.
Validation Prompt Tips
- Use a fresh closing clip to gauge generalization.
- Keep the prompt aligned with the lead-in you expect.
Common Pitfalls
`conditioning_seconds`too close to (or above)`audio_duration_seconds`— nothing left to generate.`audio_duration_seconds`so long that most clips are skipped.- Background noise polluting the learned lead-in.