Model Gallery
Search trends
Search Results
31 models found
LatentSync is a video-to-video model that generates lip sync animations from audio using advanced algorithms for high-quality synchronization.
CSM (Conversational Speech Model) is a speech generation model from Sesame that generates RVQ audio codes from text and audio inputs.
DiffRhythm is a blazing fast model for transforming lyrics into full songs. It boasts the capability to generate full songs in less than 30 seconds.
Generate multilingual text-to-speech audio using ElevenLabs TTS Multilingual v2.
Generate sound effects using ElevenLabs advanced sound effects model.
Isolate audio tracks using ElevenLabs advanced audio isolation technology.
Generate high-speed text-to-speech audio using ElevenLabs TTS Turbo v2.5.
An expressive and natural French text-to-speech model for both European and Canadian French.
A fast and expressive Hindi text-to-speech model with clear pronunciation and accurate intonation.
A highly efficient Mandarin Chinese text-to-speech model that captures natural tones and prosody.
A high-quality British English text-to-speech model offering natural and expressive voice synthesis.
A high-quality Italian text-to-speech model delivering smooth and expressive speech synthesis.
A natural-sounding Spanish text-to-speech model optimized for Latin American and European Spanish.
Kokoro is a lightweight text-to-speech model that delivers comparable quality to larger models while being significantly faster and more cost-efficient.
A natural and expressive Brazilian Portuguese text-to-speech model optimized for clarity and fluency.
Clone voice of any person and speak anything in their voice using zonos' voice cloning.
A fast and natural-sounding Japanese text-to-speech model optimized for smooth pronunciation.
YuE is a groundbreaking series of open-source foundation models designed for music generation, specifically for transforming lyrics into full songs.
Get waveform data from audio files using FFmpeg API.
Get encoding metadata from video and audio files using FFmpeg API.
Generate realistic lipsync animations from audio using advanced algorithms for high-quality synchronization.
Automatically generates text captions for your videos from the audio as per text colour/font specifications
MMAudio generates synchronized audio given text inputs. It can generate sounds described by a prompt.
Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation
Blazing-fast text-to-speech. Generate audio with improved emotional tones and extensive multilingual support. Ideal for high-volume processing and efficient workflows.
Generate music from text prompts using the MiniMax model, which leverages advanced AI techniques to create high-quality, diverse musical compositions.
MMAudio generates synchronized audio given video and/or text inputs. It can be combined with video models to get videos with audio.
F5 TTS
Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation
MuseTalk is a real-time high quality audio-driven lip-syncing model. Use MuseTalk to animate a face with your own audio.
Open source text-to-audio model.