fal-ai/stable-audio-3-trainer
Input
Hint: Drag and drop files from your computer, images from web pages, paste from clipboard (Ctrl/Cmd+V), or provide a URL.
Customize your input with more control.
Your request will cost $3.00 per 1000-step training run. It scales per step, so a 2000-step training run will cost $6.00. Training with batch_size > 1 bills additional units proportional to batch_size × clip duration.
Training history
Nothing here yet...
Fine-tune your training parameters and start right now.
Stable Audio 3 LoRA Trainer
Fine-tune a Stable Audio 3 base model into a LoRA that adapts it to your own
style, instrument, or sound. The output is a small `.safetensors` file you load
on the Stable Audio 3 inference endpoints (the `/lora` variants).
Dataset
Provide a `.zip` archive (via `audio_data_url`) of audio clips, each paired
with a `.txt` caption that has the same basename:
my_dataset.zip ├── clip1.wav ├── clip1.txt # text prompt describing clip1 ├── clip2.flac ├── clip2.txt └── ...
- Audio: WAV, FLAC, MP3, OGG, OPUS, or AIFF (WAV/FLAC recommended). 4–2000 clips.
- Captions: one non-empty UTF-8
`.txt`per clip. Describe the audio the way you would prompt the model — e.g. genre, instruments, mood, and BPM for music; source, action, and recording character for SFX. - Mixed clip lengths are fine; clips longer than the model's max are cropped.
Models (`model`)
| Model | Use for | Max clip length |
|---|---|---|
`medium-base` (default) | music, stems, SFX | ~6 min (380 s) |
`small-music-base` | music, stems | 2 min |
`small-sfx-base` | sound effects | 2 min |
Key parameters
| Parameter | Default | Notes |
|---|---|---|
`number_of_steps` | 1000 | 1–20000. ~500–2000 is typical; more = stronger adaptation. |
`rank` | 16 | 1–256. Higher = more expressive and a larger file. `-xs` adapters require ≤ 17. |
`learning_rate` | 1e-4 | AdamW learning rate for the LoRA. |
`adapter_type` | `dora-rows` | One of `lora`, `dora`, `dora-rows`, `dora-cols`, `bora`, or their `-xs` variants (minimal parameters/VRAM). |
`duration` | auto | Crop length in seconds. Auto-detected from the longest clip and capped at the model's max. |
`batch_size` | 1 | 1–8, must be ≤ the number of clips. Values > 1 bill extra units proportional to batch size × clip duration. |
`base_precision` | `bf16` | Precision of the frozen base weights (LoRA stays fp32). |
`seed` | 42 | For reproducible runs. |
Advanced: `include` / `exclude` (limit which layers receive LoRA),
`lora_checkpoint_url` (resume from an existing `.safetensors`), and `pre_encode`
(pre-encode audio to latents before training).
Output
`lora_file`— the trained`.safetensors`LoRA.`config_file`— JSON metadata for the run (model, steps, rank, etc.).
Load `lora_file` on the matching Stable Audio 3 inference endpoint's `/lora`
variant (e.g. `/medium/text-to-audio/lora`) to generate with your fine-tune.