Stable Audio 3 Trainer (Training) API on fal

Stable Audio 3 LoRA Trainer

Fine-tune a Stable Audio 3 base model into a LoRA that adapts it to your own style, instrument, or sound. The output is a small `.safetensors` file you load on the Stable Audio 3 inference endpoints (the `/lora` variants).

Dataset

Provide a `.zip` archive (via `audio_data_url`) of audio clips, each paired with a `.txt` caption that has the same basename:


my_dataset.zip
├── clip1.wav
├── clip1.txt        # text prompt describing clip1
├── clip2.flac
├── clip2.txt
└── ...

Audio: WAV, FLAC, MP3, OGG, OPUS, or AIFF (WAV/FLAC recommended). 4–2000 clips.
Captions: one non-empty UTF-8 `.txt` per clip. Describe the audio the way you would prompt the model — e.g. genre, instruments, mood, and BPM for music; source, action, and recording character for SFX.
Mixed clip lengths are fine; clips longer than the model's max are cropped.

Models (`model`)

Model	Use for	Max clip length
`medium-base` (default)	music, stems, SFX	~6 min (380 s)
`small-music-base`	music, stems	2 min
`small-sfx-base`	sound effects	2 min

Key parameters

Parameter	Default	Notes
`number_of_steps`	1000	1–20000. ~500–2000 is typical; more = stronger adaptation.
`rank`	16	1–256. Higher = more expressive and a larger file. `-xs` adapters require ≤ 17.
`learning_rate`	1e-4	AdamW learning rate for the LoRA.
`adapter_type`	`dora-rows`	One of `lora`, `dora`, `dora-rows`, `dora-cols`, `bora`, or their `-xs` variants (minimal parameters/VRAM).
`duration`	auto	Crop length in seconds. Auto-detected from the longest clip and capped at the model's max.
`batch_size`	1	1–8, must be ≤ the number of clips. Values > 1 bill extra units proportional to batch size × clip duration.
`base_precision`	`bf16`	Precision of the frozen base weights (LoRA stays fp32).
`seed`	42	For reproducible runs.

Advanced: `include` / `exclude` (limit which layers receive LoRA), `lora_checkpoint_url` (resume from an existing `.safetensors`), and `pre_encode` (pre-encode audio to latents before training).

Output

`lora_file` — the trained `.safetensors` LoRA.
`config_file` — JSON metadata for the run (model, steps, rank, etc.).

Load `lora_file` on the matching Stable Audio 3 inference endpoint's `/lora` variant (e.g. `/medium/text-to-audio/lora`) to generate with your fine-tune.

fal-ai/stable-audio-3-trainer

Input

Training history

Nothing here yet...

Stable Audio 3 LoRA Trainer

Dataset

Models (`model`)

Key parameters

Output