Run the latest models all in one Sandbox 🏖️

DiffRhythm: Lyrics to Song Text to Audio

fal-ai/diffrhythm
DiffRhythm is a blazing fast model for transforming lyrics into full songs. It boasts the capability to generate full songs in less than 30 seconds.
Inference
Commercial use

Input

Additional Settings

Customize your input with more control.

Result

Idle

What would you like to do next?

Your request will cost $0.01 per 10 second of generated audio. For $1 you can run generate 1000s of music from lyrics.

Logs

DiffRhythm: Lyrics to Song | [text-to-audio]

DiffRhythm delivers full song generation from timestamped lyrics in under 30 seconds at $0.001 per second of audio. Trading maximum musical complexity for speed and cost efficiency, the model generates 95-285 second tracks with reference audio conditioning and style control. Built for developers who need rapid music prototyping without per-generation costs spiraling into double digits.

Use Cases: Lyric Demo Creation | Video Background Music | Rapid Music Prototyping


Performance

At $0.01 per 10 seconds versus $0.05+ for alternatives, DiffRhythm delivers 5-10x cost efficiency for full-length music generation.

MetricResultContext
Generation SpeedUnder 30 secondsFull 95-285 second songs
Cost per 10 seconds$0.011,000 seconds per $1.00 on fal
Track Duration95-285 secondsTwo duration modes: standard (95s) or extended (285s)
Reference AudioSupportedStyle transfer via URL input

Structured Music Control Without DAW Complexity

DiffRhythm processes timestamped lyrics with precise timing markers, each line tagged with exact second placement. This structured input format contrasts with freeform text-to-music models that interpret vague descriptions.

What this means for you:

  • Exact timing control: Input format `[00:10.00]Moonlight spills through broken blinds` ensures lyrics sync to specific timestamps, not algorithmic guesses

  • Reference audio conditioning: Supply existing tracks via URL to guide musical style, instrumentation, and genre characteristics

  • Configurable generation parameters: Adjust CFG strength (1-10 range), scheduler type (Euler/Midpoint/RK4/Implicit Adams), and inference steps (10-100) for quality-speed tradeoffs

  • Dual duration modes: Generate 95-second tracks for rapid iteration or 285-second extended versions for full song development


Technical Specifications

SpecDetails
ArchitectureDiffRhythm
Input FormatsTimestamped lyrics (text), reference audio URL (mp3/wav/m4a/aac/ogg)
Output FormatsWAV audio (application/octet-stream)
Duration Options95 seconds, 285 seconds
LicenseCommercial use permitted

API Documentation | Quickstart Guide | Enterprise Pricing


How It Stacks Up

Sonauto V2 Text to Audio – DiffRhythm ($0.001/sec) prioritizes timestamped lyric control and reference audio conditioning for structured song generation at 5x lower cost than Sonauto's freeform text-to-music approach. Sonauto V2 emphasizes natural language music descriptions without timestamp requirements for more exploratory creative workflows.