Run the latest models all in one Sandbox 🏖️

Kokoro TTS Text to Audio

fal-ai/kokoro/american-english
Kokoro is a lightweight text-to-speech model that delivers comparable quality to larger models while being significantly faster and more cost-efficient.
Inference
Commercial use

Input

Additional Settings

Customize your input with more control.

Result

Idle

What would you like to do next?

Your request will cost $0.02 per 1000 character.

Logs

Kokoro TTS | [text-to-speech]

Kokoro TTS delivers natural-sounding speech synthesis with 82 million parameters at $0.02 per 1,000 characters. Trading model size for efficiency, it matches the quality of models 10-50x larger while running significantly faster and cheaper. Built for developers who need production-grade voice synthesis without enterprise-scale infrastructure costs.

Use Cases: Voice Agents & Assistants | Content Narration & Audiobooks | Multi-language Product Localization


Performance

Kokoro TTS achieves comparable quality to larger TTS models while operating at significantly more cost-effective rates, making high-quality speech synthesis accessible for high-volume applications.

MetricResultContext
Model Size82M parameters10-50x smaller than comparable quality models
Cost per Inference$0.02 per 1,000 characters50,000 characters per $1.00 on fal
Voice Options19 voices10 female (af_), 9 male (am_) variants
Speed Control0.1x to 5.0xAdjustable playback rate for different use cases
Output FormatWAV audioStandard format compatible with all platforms

Efficiency Without Compromise

Kokoro TTS uses a lightweight architecture that prioritizes parameter efficiency over raw model size. While most high-quality TTS models require hundreds of millions or billions of parameters, Kokoro achieves comparable naturalness and expressiveness with just 82 million parameters through optimized training and architectural choices.

What this means for you:

  • Lower latency: Smaller model size translates to faster inference times, critical for real-time voice applications and interactive agents

  • Cost efficiency: At $0.02 per 1,000 characters, you can generate 50,000 characters for $1.00, enabling high-volume use cases like audiobook production or large-scale content narration

  • Voice variety: 19 distinct voices (10 female, 9 male) provide flexibility for different brand voices, character work, or user preference matching without additional fine-tuning

  • Playback control: Speed adjustment from 0.1x to 5.0x enables use cases from meditative content (slower) to rapid information delivery (faster) using the same base synthesis


Technical Specifications

SpecDetails
ArchitectureKokoro TTS
Input FormatsText strings (UTF-8)
Output FormatsWAV audio file
Voice Selection19 pre-trained voices via voice parameter
LicenseCommercial use permitted

API Documentation | Quickstart Guide | Enterprise Pricing


How It Stacks Up

Elevenlabs Text to Audio – Kokoro TTS ($0.02/1K chars) prioritizes cost efficiency and lightweight deployment for high-volume applications. Elevenlabs offers more advanced voice cloning and emotional range control for premium voice experiences where budget is less constrained.

Kokoro TTS (Mandarin Chinese) – Mandarin-specific version handling tonal language requirements. American English variant handles English phonetics and prosody, while Mandarin version manages tonal variations and character-based input for Chinese language applications.