Model Gallery
Search trends
Featured Models
Check out some of our most popular models
MMAudio generates synchronized audio given video and/or text inputs. It can be combined with video models to get videos with audio.
Generate natural-sounding multi-speaker dialogues, and audio. Perfect for expressive outputs, storytelling, games, animations, and interactive media.
CassetteAI’s model generates a 30-second sample in under 2 seconds and a full 3-minute track in under 10 seconds. At 44.1 kHz stereo audio, expect a level of professional consistency with no breaks, no squeaks, and no random interruptions in your creations.
Search Results
40 models found
Open source text-to-audio model.
Kling LipSync is an audio-to-video model that generates realistic lip movements from audio input.
Clone voice of any person and speak anything in their voice using zonos' voice cloning.
Clone dialog voices from a sample audio and generate dialogs from text prompts using the Dia TTS which leverages advanced AI techniques to create high-quality text-to-speech.
Generate realistic lipsync animations from audio using advanced algorithms for high-quality synchronization with Sync Lipsync 2.0 model
Clone a voice from a sample audio and generate speech from text prompts using the MiniMax model, which leverages advanced AI techniques to create high-quality text-to-speech.
Blazing-fast text-to-speech. Generate audio with improved emotional tones and extensive multilingual support. Ideal for high-volume processing and efficient workflows.
Generate natural-sounding multi-speaker dialogues, and audio. Perfect for expressive outputs, storytelling, games, animations, and interactive media.
Dia directly generates realistic dialogue from transcripts. Audio conditioning enables emotion control. Produces natural nonverbals like laughter and throat clearing.
An open source, community-driven and native audio turn detection model by Pipecat AI.
Generate high-speed text-to-speech audio using ElevenLabs TTS Turbo v2.5.
Generate multilingual text-to-speech audio using ElevenLabs TTS Multilingual v2.
Get encoding metadata from video and audio files using FFmpeg API.
Get waveform data from audio files using FFmpeg API.
Generate realistic lipsync animations from audio using advanced algorithms for high-quality synchronization.
Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation
MuseTalk is a real-time high quality audio-driven lip-syncing model. Use MuseTalk to animate a face with your own audio.
Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation
LatentSync is a video-to-video model that generates lip sync animations from audio using advanced algorithms for high-quality synchronization.
Automatically generates text captions for your videos from the audio as per text colour/font specifications
CassetteAI’s model generates a 30-second sample in under 2 seconds and a full 3-minute track in under 10 seconds. At 44.1 kHz stereo audio, expect a level of professional consistency with no breaks, no squeaks, and no random interruptions in your creations.
CSM (Conversational Speech Model) is a speech generation model from Sesame that generates RVQ audio codes from text and audio inputs.
Create stunningly realistic sound effects in seconds - CassetteAI's Sound Effects Model generates high-quality SFX up to 30 seconds long in just 1 second of processing time
DiffRhythm is a blazing fast model for transforming lyrics into full songs. It boasts the capability to generate full songs in less than 30 seconds.
Generate sound effects using ElevenLabs advanced sound effects model.
A highly efficient Mandarin Chinese text-to-speech model that captures natural tones and prosody.
A fast and expressive Hindi text-to-speech model with clear pronunciation and accurate intonation.
Kokoro is a lightweight text-to-speech model that delivers comparable quality to larger models while being significantly faster and more cost-efficient.
A natural and expressive Brazilian Portuguese text-to-speech model optimized for clarity and fluency.
A fast and natural-sounding Japanese text-to-speech model optimized for smooth pronunciation.
A high-quality Italian text-to-speech model delivering smooth and expressive speech synthesis.
An expressive and natural French text-to-speech model for both European and Canadian French.
A natural-sounding Spanish text-to-speech model optimized for Latin American and European Spanish.
A high-quality British English text-to-speech model offering natural and expressive voice synthesis.
YuE is a groundbreaking series of open-source foundation models designed for music generation, specifically for transforming lyrics into full songs.
Generate music from text prompts using the MiniMax model, which leverages advanced AI techniques to create high-quality, diverse musical compositions.
F5 TTS