fal-ai/ltx23-trainer-v2/t2a
Input
Hint: Upload a prepared .zip archive or provide a URL. See the field description for the required file layout.
Customize your input with more control.
The cost of training depends on the number of steps. The formula is: 0.0022 * steps. With 1000 steps, your request will cost $2.20.
Training history
Nothing here yet...
Fine-tune your training parameters and start right now.
LTX 2.3 Trainer — Text-to-Audio (`/t2a`)
Overview
The `/t2a` endpoint trains a LoRA for the LTX 2.3 model that generates audio from a text prompt — the audio counterpart of text-to-video. There is no conditioning asset; the model learns to produce a sound or style from your training clips and recreate it from text. Use it to teach a particular instrument, sound effect family, ambience, or audio style.
Key features:
- Learns text → audio generation.
- Audio-only — no video is processed or generated.
- Fixed audio length bucket via
`audio_duration_seconds`. - Optional trigger phrase to activate the learned sound on demand.
- Validation previews generate audio from each prompt; the output preview is audio.
Dataset Format
Provide a single `.zip` archive (linked via `training_data_url`) of plain audio clips:
- Audio:
`.wav`,`.mp3`,`.flac`,`.ogg`,`.aac`,`.m4a` - Captions: a
`.txt`with the same base name as each audio clip (optional but recommended).
File names must be unique across the archive. Aim for at least 10 clips.
Example layout:
clip01.wav clip01.txt clip02.mp3 clip02.txt
At least one clip must fill the audio bucket. A clip shorter than `audio_duration_seconds` is skipped; if every clip is shorter, the request is rejected up front (HTTP 422). Lower `audio_duration_seconds` if your clips are short.
Input Parameters Reference
Dataset
`training_data_url` (required)
Type: `string`
URL to the `.zip` archive of audio clips.
`trigger_phrase`
Type: `string`
Default: `""`
Phrase prepended to captions during training; include it at inference to activate the learned sound.
Training Parameters
`rank`
Type: `integer` (`8`, `16`, `32`, `64`, `128`)
Default: `32`
LoRA capacity.
`number_of_steps`
Type: `integer`
Default: `2000` (range `100`–`20000`)
Number of optimization steps.
`learning_rate`
Type: `number`
Default: `0.0002`
Optimization step size.
Audio Configuration
`audio_duration_seconds`
Type: `number`
Default: `5.0` (range `0.5`–`60.0`)
Target audio clip length in seconds (the audio duration bucket). Clips shorter than this are skipped; longer clips are trimmed.
`audio_normalize`
Type: `boolean`
Default: `true`
Peak-normalize audio for consistent loudness across the dataset.
`audio_preserve_pitch`
Type: `boolean`
Default: `true`
Preserve pitch when fitting audio to the target duration (instead of trimming/padding).
Validation
`validation`
Type: `array`
Default: `[]` (max 2 entries)
Validation samples, each an object with:
`prompt`(`string`) — the text prompt to generate audio from.
`validation_negative_prompt`
Type: `string`
Default: a built-in quality negative prompt.
`validation_frame_rate`
Type: `integer`
Default: `24` (range `8`–`60`)
Used together with the audio bucket to size the preview audio length.
`stg_scale`
Type: `number`
Default: `1.0` (range `0.0`–`3.0`)
`debug_dataset`
Type: `boolean`
Default: `false`
Return an archive of the preprocessed data for inspection.
Note: this audio-only mode also accepts the shared video/validation-video fields (
`number_of_frames`,`resolution`, etc.), but they have no effect — no video is processed. You can safely leave them at their defaults.
Outputs
`lora_file`— the trained LoRA weights (`.safetensors`).`config_file`— JSON describing the trigger phrase and training type.`audio`— a combined preview of the generated validation audio.`debug_dataset`— preprocessed-data archive, only when`debug_dataset`is enabled.
Billing
A successful run is billed `max(100, number_of_steps)` billable units. Requests that fail before training completes (input-validation errors / HTTP 422, or dataset-download failures) are billed 0 units.
How the Training Works
Pipeline Overview
- Preprocessing — the archive is extracted, audio matched to captions, and each clip fit to the
`audio_duration_seconds`bucket. - Training — the LoRA trains for
`number_of_steps`, learning to generate audio from text. Validation previews run at intervals. - Output — the LoRA, config, and combined audio preview are uploaded.
What Happens to Your Data
- Archive extraction: the
`.zip`is unpacked; macOS metadata and hidden files are ignored. - File matching: each audio clip is paired with the
`.txt`of the same base name. - Audio fitting: each clip is fit to the
`audio_duration_seconds`bucket (clips shorter than the bucket are skipped; longer ones trimmed; normalized and pitch-fit as configured). - Captions: the trigger phrase (if set) is prepended.
LoRA Training
A LoRA is a compact set of adapter weights trained on top of the frozen base model. Here it specializes the model at producing your sound or audio style from a text prompt. At inference you load the LoRA and prompt it as you captioned your data.
Tips for Getting Good Results
Dataset Quality
- Use at least 10 clean clips representative of the sound/style you want.
- Keep recordings consistent in level and free of unrelated noise.
- Pick an
`audio_duration_seconds`that fits most clips so few are skipped.
Caption Best Practices
- Describe the sound plainly, optionally with a trigger phrase.
- Keep captions consistent in style across clips.
Good caption: `f0leyvox a soft rain shower on a tin roof`
Weak caption: `rain`
Trigger Phrases
- Use a distinctive trigger phrase for a specific sound/style; include it in every caption and at inference.
- Skip the trigger phrase for an always-on audio style.
Inference Format Matching
Prompt the LoRA the way you captioned it, including the trigger phrase, and keep expected durations in line with `audio_duration_seconds`.
Recommended Starting Configuration
json{ "training_data_url": "https://example.com/audio_clips.zip", "trigger_phrase": "f0leyvox", "rank": 32, "number_of_steps": 2000, "learning_rate": 0.0002, "audio_duration_seconds": 5.0, "validation": [ { "prompt": "f0leyvox soft rain on a roof" } ] }
Diagnosing Issues
- Generated audio is generic: add more representative clips; keep captions specific; verify the trigger phrase is used.
- Many clips skipped: lower
`audio_duration_seconds`to match your clips. - Overfitting: fewer steps, lower
`rank`, more clips.
Validation Prompt Tips
- Use prompts in the same style as your captions, including the trigger phrase.
- Exercise the kinds of prompts you expect to use at inference.
Common Pitfalls
`audio_duration_seconds`set so long that most clips are skipped.- Background noise/music polluting the learned sound.
- Forgetting the trigger phrase at inference.