LTX 2.3 Trainer (V2) - Text-to-Audio (Training) API on fal

LTX 2.3 Trainer — Text-to-Audio (`/t2a`)

Overview

The `/t2a` endpoint trains a LoRA for the LTX 2.3 model that generates audio from a text prompt — the audio counterpart of text-to-video. There is no conditioning asset; the model learns to produce a sound or style from your training clips and recreate it from text. Use it to teach a particular instrument, sound effect family, ambience, or audio style.

Key features:

Learns text → audio generation.
Audio-only — no video is processed or generated.
Fixed audio length bucket via `audio_duration_seconds`.
Optional trigger phrase to activate the learned sound on demand.
Validation previews generate audio from each prompt; the output preview is audio.

Dataset Format

Provide a single `.zip` archive (linked via `training_data_url`) of plain audio clips:

Audio: `.wav`, `.mp3`, `.flac`, `.ogg`, `.aac`, `.m4a`
Captions: a `.txt` with the same base name as each audio clip (optional but recommended).

File names must be unique across the archive. Aim for at least 10 clips.

Example layout:


clip01.wav   clip01.txt
clip02.mp3   clip02.txt

At least one clip must fill the audio bucket. A clip shorter than `audio_duration_seconds` is skipped; if every clip is shorter, the request is rejected up front (HTTP 422). Lower `audio_duration_seconds` if your clips are short.

Input Parameters Reference

Dataset

`training_data_url` (required)

Type: `string`

URL to the `.zip` archive of audio clips.

`trigger_phrase`

Type: `string` Default: `""`

Phrase prepended to captions during training; include it at inference to activate the learned sound.

Training Parameters

`rank`

Type: `integer` (`8`, `16`, `32`, `64`, `128`) Default: `32`

LoRA capacity.

`number_of_steps`

Type: `integer` Default: `2000` (range `100`–`20000`)

Number of optimization steps.

`learning_rate`

Type: `number` Default: `0.0002`

Optimization step size.

Audio Configuration

`audio_duration_seconds`

Type: `number` Default: `5.0` (range `0.5`–`60.0`)

Target audio clip length in seconds (the audio duration bucket). Clips shorter than this are skipped; longer clips are trimmed.

`audio_normalize`

Type: `boolean` Default: `true`

Peak-normalize audio for consistent loudness across the dataset.

`audio_preserve_pitch`

Type: `boolean` Default: `true`

Preserve pitch when fitting audio to the target duration (instead of trimming/padding).

Validation

`validation`

Type: `array` Default: `[]` (max 2 entries)

Validation samples, each an object with:

`prompt` (`string`) — the text prompt to generate audio from.

`validation_negative_prompt`

Type: `string` Default: a built-in quality negative prompt.

`validation_frame_rate`

Type: `integer` Default: `24` (range `8`–`60`)

Used together with the audio bucket to size the preview audio length.

`stg_scale`

Type: `number` Default: `1.0` (range `0.0`–`3.0`)

`debug_dataset`

Type: `boolean` Default: `false`

Return an archive of the preprocessed data for inspection.

Note: this audio-only mode also accepts the shared video/validation-video fields (`number_of_frames`, `resolution`, etc.), but they have no effect — no video is processed. You can safely leave them at their defaults.

Outputs

`lora_file` — the trained LoRA weights (`.safetensors`).
`config_file` — JSON describing the trigger phrase and training type.
`audio` — a combined preview of the generated validation audio.
`debug_dataset` — preprocessed-data archive, only when `debug_dataset` is enabled.

Billing

A successful run is billed `max(100, number_of_steps)` billable units. Requests that fail before training completes (input-validation errors / HTTP 422, or dataset-download failures) are billed 0 units.

How the Training Works

Pipeline Overview

Preprocessing — the archive is extracted, audio matched to captions, and each clip fit to the `audio_duration_seconds` bucket.
Training — the LoRA trains for `number_of_steps`, learning to generate audio from text. Validation previews run at intervals.
Output — the LoRA, config, and combined audio preview are uploaded.

What Happens to Your Data

Archive extraction: the `.zip` is unpacked; macOS metadata and hidden files are ignored.
File matching: each audio clip is paired with the `.txt` of the same base name.
Audio fitting: each clip is fit to the `audio_duration_seconds` bucket (clips shorter than the bucket are skipped; longer ones trimmed; normalized and pitch-fit as configured).
Captions: the trigger phrase (if set) is prepended.

LoRA Training

A LoRA is a compact set of adapter weights trained on top of the frozen base model. Here it specializes the model at producing your sound or audio style from a text prompt. At inference you load the LoRA and prompt it as you captioned your data.

Tips for Getting Good Results

Dataset Quality

Use at least 10 clean clips representative of the sound/style you want.
Keep recordings consistent in level and free of unrelated noise.
Pick an `audio_duration_seconds` that fits most clips so few are skipped.

Caption Best Practices

Describe the sound plainly, optionally with a trigger phrase.
Keep captions consistent in style across clips.

Good caption: `f0leyvox a soft rain shower on a tin roof` Weak caption: `rain`

Trigger Phrases

Use a distinctive trigger phrase for a specific sound/style; include it in every caption and at inference.
Skip the trigger phrase for an always-on audio style.

Inference Format Matching

Prompt the LoRA the way you captioned it, including the trigger phrase, and keep expected durations in line with `audio_duration_seconds`.

Recommended Starting Configuration

json
{
  "training_data_url": "https://example.com/audio_clips.zip",
  "trigger_phrase": "f0leyvox",
  "rank": 32,
  "number_of_steps": 2000,
  "learning_rate": 0.0002,
  "audio_duration_seconds": 5.0,
  "validation": [
    { "prompt": "f0leyvox soft rain on a roof" }
  ]
}

Diagnosing Issues

Generated audio is generic: add more representative clips; keep captions specific; verify the trigger phrase is used.
Many clips skipped: lower `audio_duration_seconds` to match your clips.
Overfitting: fewer steps, lower `rank`, more clips.

Validation Prompt Tips

Use prompts in the same style as your captions, including the trigger phrase.
Exercise the kinds of prompts you expect to use at inference.

Common Pitfalls

`audio_duration_seconds` set so long that most clips are skipped.
Background noise/music polluting the learned sound.
Forgetting the trigger phrase at inference.

fal-ai/ltx23-trainer-v2/t2a

Input

Training history

Nothing here yet...