fal-ai/stable-audio-3-trainer

Stable Audio 3 LoRA Trainer fine-tunes Stable Audio 3 base models on paired audio-caption datasets, producing compact LoRA weights that adapt generation toward a custom music style, sound palette, or domain.
Training
Commercial use

Input

Additional Settings

Customize your input with more control.

Your request will cost $3.00 per 1000-step training run. It scales per step, so a 2000-step training run will cost $6.00. Training with batch_size > 1 bills additional units proportional to batch_size × clip duration.

Training history

Note: these are the most recent training requests. For the full history, check the requests tab.

Stable Audio 3 LoRA Trainer

Fine-tune a Stable Audio 3 base model into a LoRA that adapts it to your own style, instrument, or sound. The output is a small `.safetensors` file you load on the Stable Audio 3 inference endpoints (the `/lora` variants).

Dataset

Provide a `.zip` archive (via `audio_data_url`) of audio clips, each paired with a `.txt` caption that has the same basename:

my_dataset.zip
├── clip1.wav
├── clip1.txt        # text prompt describing clip1
├── clip2.flac
├── clip2.txt
└── ...
  • Audio: WAV, FLAC, MP3, OGG, OPUS, or AIFF (WAV/FLAC recommended). 4–2000 clips.
  • Captions: one non-empty UTF-8 `.txt` per clip. Describe the audio the way you would prompt the model — e.g. genre, instruments, mood, and BPM for music; source, action, and recording character for SFX.
  • Mixed clip lengths are fine; clips longer than the model's max are cropped.

Models (`model`)

ModelUse forMax clip length
`medium-base` (default)music, stems, SFX~6 min (380 s)
`small-music-base`music, stems2 min
`small-sfx-base`sound effects2 min

Key parameters

ParameterDefaultNotes
`number_of_steps`10001–20000. ~500–2000 is typical; more = stronger adaptation.
`rank`161–256. Higher = more expressive and a larger file. `-xs` adapters require ≤ 17.
`learning_rate`1e-4AdamW learning rate for the LoRA.
`adapter_type``dora-rows`One of `lora`, `dora`, `dora-rows`, `dora-cols`, `bora`, or their `-xs` variants (minimal parameters/VRAM).
`duration`autoCrop length in seconds. Auto-detected from the longest clip and capped at the model's max.
`batch_size`11–8, must be ≤ the number of clips. Values > 1 bill extra units proportional to batch size × clip duration.
`base_precision``bf16`Precision of the frozen base weights (LoRA stays fp32).
`seed`42For reproducible runs.

Advanced: `include` / `exclude` (limit which layers receive LoRA), `lora_checkpoint_url` (resume from an existing `.safetensors`), and `pre_encode` (pre-encode audio to latents before training).

Output

  • `lora_file` — the trained `.safetensors` LoRA.
  • `config_file` — JSON metadata for the run (model, steps, rank, etc.).

Load `lora_file` on the matching Stable Audio 3 inference endpoint's `/lora` variant (e.g. `/medium/text-to-audio/lora`) to generate with your fine-tune.