Model Gallery
Search trends
Featured Models
Check out some of our most popular models
Generate video clips from your images using Kling 2.0 Master
Wan Effects is a model that generates high-quality videos with popular effects from images
Veo 2 creates videos from images with realistic motion and very high quality output.
Wan-2.1 Pro is a premium image-to-video model that generates high-quality 1080p videos at 30fps with up to 6 seconds duration, delivering exceptional visual quality and motion diversity from images
Wan-2.1 is a image-to-video model that generates high-quality videos with high visual quality and motion diversity from images
Generate video clips from your images using Kling 1.6 (pro)
Train styles, people and other subjects at blazing speeds.
Generate natural-sounding multi-speaker dialogues, and audio. Perfect for expressive outputs, storytelling, games, animations, and interactive media.
FLUX1.1 [pro] ultra is the newest version of FLUX1.1 [pro], maintaining professional-grade image quality while delivering up to 2K resolution with improved photo realism.
All Models
Explore all available models provided by fal.ai
Recraft V3 is a text-to-image model with the ability to generate long texts, vector art, images in brand style, and much more. As of today, it is SOTA in image generation, proven by Hugging Face's industry-leading Text-to-Image Benchmark by Artificial Analysis.
Generate video clips from your images using MiniMax Video model
Generate video clips from your prompts using Kling 2.0 Master
HiDream-I1 full is a new open-source image generative foundation model with 17B parameters that achieves state-of-the-art image generation quality within seconds.
HiDream-I1 dev is a new open-source image generative foundation model with 17B parameters that achieves state-of-the-art image generation quality within seconds.
HiDream-I1 fast is a new open-source image generative foundation model with 17B parameters that achieves state-of-the-art image generation quality within 16 steps.
Add sound effects to your videos
FLUX.1 [dev] is a 12 billion parameter flow transformer that generates high-quality images from text. It is suitable for personal and commercial use.
CassetteAI’s model generates a 30-second sample in under 2 seconds and a full 3-minute track in under 10 seconds. At 44.1 kHz stereo audio, expect a level of professional consistency with no breaks, no squeaks, and no random interruptions in your creations.
MMAudio generates synchronized audio given video and/or text inputs. It can be combined with video models to get videos with audio.
Generate high-quality images, posters, and logos with Ideogram V2. Features exceptional typography handling and realistic outputs optimized for commercial and creative use.
FLUX LoRA training optimized for portrait generation, with bright highlights, excellent prompt following and highly detailed results.
Stable Diffusion 3.5 Large is a Multimodal Diffusion Transformer (MMDiT) text-to-image model that features improved performance in image quality, typography, complex prompt understanding, and resource-efficiency.
Super fast endpoint for the FLUX.1 [dev] inpainting model with LoRA support, enabling rapid and high-quality image inpaingting using pre-trained LoRA adaptations for personalization, specific styles, brand identities, and product-specific outputs.
A versatile endpoint for the FLUX.1 [dev] model that supports multiple AI extensions including LoRA, ControlNet conditioning, and IP-Adapter integration, enabling comprehensive control over image generation through various guidance methods.
Super fast endpoint for the FLUX.1 [dev] model with LoRA support, enabling rapid and high-quality image generation using pre-trained LoRA adaptations for personalization, specific styles, brand identities, and product-specific outputs.
FLUX.1 Image-to-Image is a high-performance endpoint for the FLUX.1 [dev] model that enables rapid transformation of existing images, delivering high-quality style transfers and image modifications with the core FLUX capabilities.
Upscale your images with AuraSR.
Clarity upscaler for upscaling images with high very fidelity.
Framepack is an efficient Image-to-video model that autoregressively generates videos.
Transform images into 3D cartoon artwork using an AI model that applies cartoon stylization while preserving the original image's composition and details.
Vace a video generation model that uses a source image, mask, and video to create prompted videos with controllable sources.
Finegrain Eraser removes any object selected with a mask—along with its shadows, reflections, and lighting artifacts—seamlessly reconstructing the scene with contextually accurate content.
Finegrain Eraser removes any object selected with a bounding box—along with its shadows, reflections, and lighting artifacts—seamlessly reconstructing the scene with contextually accurate content.
Finegrain Eraser removes objects—along with their shadows, reflections, and lighting artifacts—using only natural language, seamlessly filling the scene with contextually accurate content.
Leverage the rapid processing capabilities of AI models to enable accurate and efficient real-time speech-to-text transcription.
Leverage the rapid processing capabilities of AI models to enable accurate and efficient real-time speech-to-text transcription.
Leverage the rapid processing capabilities of AI models to enable accurate and efficient real-time speech-to-text transcription.
Leverage the rapid processing capabilities of AI models to enable accurate and efficient real-time speech-to-text transcription.
Create stunningly realistic sound effects in seconds - CassetteAI's Sound Effects Model generates high-quality SFX up to 30 seconds long in just 1 second of processing time
Generate realistic lipsync animations from audio using advanced algorithms for high-quality synchronization with Sync Lipsync 2.0 model
AI vectorization model that transforms raster images into scalable SVG graphics, preserving visual details while enabling infinite scaling and easy editing capabilities.
Generate fast high quality video clips from text and image prompts using PixVerse v4
Generate high quality video clips from text and image prompts using PixVerse v4
Generate high quality video clips with different effects using PixVerse v3.5
Generate high quality video clips from text and image prompts using PixVerse v4
Create seamless transition between images using PixVerse v3.5
Generate high quality and fast video clips from text and image prompts using PixVerse v4 fast
Reimagine and transform your ordinary photos into enchanting Studio Ghibli style artwork
Clone a voice from a sample audio and generate speech from text prompts using the MiniMax model, which leverages advanced AI techniques to create high-quality text-to-speech.
Generate fast speech from text prompts and different voices using the MiniMax Speech-02 Turbo model, which leverages advanced AI techniques to create high-quality text-to-speech.
Generate speech from text prompts and different voices using the MiniMax Speech-02 HD model, which leverages advanced AI techniques to create high-quality text-to-speech.
Orpheus TTS is a state-of-the-art, Llama-based Speech-LLM designed for high-quality, empathetic text-to-speech generation. This model has been finetuned to deliver human-level speech synthesis, achieving exceptional clarity, expressiveness, and real-time performances.
Sana Sprint is a text-to-image model capable of generating 4K images with exceptional speed.
Sana v1.5 1.6B is a lightweight text-to-image model that delivers 4K image generation with impressive efficiency.
Sana v1.5 4.8B is a powerful text-to-image model that generates ultra-high quality 4K images with remarkable detail.
Kling LipSync is an audio-to-video model that generates realistic lip movements from audio input.
Kling LipSync is a text-to-video model that generates realistic lip movements from text input.
LatentSync is a video-to-video model that generates lip sync animations from audio using advanced algorithms for high-quality synchronization.
Add custom LoRAs to Wan-2.1 is a text-to-video model that generates high-quality videos with high visual quality and motion diversity from images
Train custom LoRAs for Wan-2.1
Fix low resolution images with fast speed and quality of thera.
An advanced dehaze model to remove atmospheric haze, restoring clarity and detail in images through intelligent neural network processing.
Gemini Flash Edit is a model that can edit single image using a text prompt and a reference image.
Generate 3D models from your images using Hunyuan 3D. A native 3D generative model enabling versatile and high-quality 3D asset creation.
Generate 3D models from your images using Hunyuan 3D. A native 3D generative model enabling versatile and high-quality 3D asset creation.
Generate 3D models from your images using Hunyuan 3D. A native 3D generative model enabling versatile and high-quality 3D asset creation.
Generate 3D models from your images using Hunyuan 3D. A native 3D generative model enabling versatile and high-quality 3D asset creation.
Generate 3D models from your images using Hunyuan 3D. A native 3D generative model enabling versatile and high-quality 3D asset creation.
Generate 3D models from your images using Hunyuan 3D. A native 3D generative model enabling versatile and high-quality 3D asset creation.
Gemini Flash Edit Multi Image is a model that can edit multiple images using a text prompt and a reference image.
Ray2 Flash is a fast video generative model capable of creating realistic visuals with natural, coherent motion.
Ray2 Flash is a fast video generative model capable of creating realistic visuals with natural, coherent motion.
Pika Effects are AI-powered video effects designed to modify objects, characters, and environments in a fun, engaging, and visually compelling manner.
Pika v2.1 creates videos from a text prompt with high quality output.
Pika v2 Turbo creates videos from images with high quality output.
Pika v2.1 creates videos from images with high quality output.
Pika v2.2 creates videos from images with high quality output.
Pika v2.2 creates videos from a text prompt with high quality output.
Pika Scenes v2.2 creates videos from a images with high quality output.
Pika v2 Turbo creates videos from a text prompt with high quality output.
Invisible Watermark is a model that can add an invisible watermark to an image.
CSM (Conversational Speech Model) is a speech generation model from Sesame that generates RVQ audio codes from text and audio inputs.
Vidu Reference to Video creates videos by using a reference images and combining them with a prompt.
Vidu Start-End to Video generates smooth transition videos between specified start and end images.
Vidu Template to Video lets you create different effects by applying motion templates to your images.
Vidu Image to Video generates high-quality videos with exceptional visual quality and motion diversity from a single image
Wan-2.1 Pro is a premium text-to-video model that generates high-quality 1080p videos at 30fps with up to 6 seconds duration, delivering exceptional visual quality and motion diversity from text prompts
Swap faces of one or two people at once, while preserving user and scene details!
Generate video clips from your prompts using Kling 1.5 (pro)
Generate video clips from your prompts using Kling 1.6 (pro)
Generate video clips from your prompts using Kling 1.6 (std)
Generate video clips from your prompts using Kling 1.0
Generate high quality images from text prompts using MiniMax. Longer text prompts will result in better quality images.
Image to Video for the high-quality Hunyuan Video I2V model.
Juggernaut Base Flux LoRA by RunDiffusion is a drop-in replacement for Flux [Dev] that delivers sharper details, richer colors, and enhanced realism to all your LoRAs and LyCORIS with full compatibility.
Juggernaut Pro Flux by RunDiffusion is the flagship Juggernaut model rivaling some of the most advanced image models available, often surpassing them in realism. It combines Juggernaut Base with RunDiffusion Photo and features enhancements like reduced background blurriness.
Generate videos from prompts and images using LTX Video-0.9.5
Juggernaut Lightning Flux by RunDiffusion provides blazing-fast, high-quality images rendered at five times the speed of Flux. Perfect for mood boards and mass ideation, this model excels in both realism and prompt adherence.
Juggernaut Pro Flux by RunDiffusion is the flagship Juggernaut model rivaling some of the most advanced image models available, often surpassing them in realism. It combines Juggernaut Base with RunDiffusion Photo and features enhancements like reduced background blurriness.
Generate videos from prompts using LTX Video-0.9.5
RunDiffusion Photo Flux provides insane realism. With this enhancer, textures and skin details burst to life, turning your favorite prompts into vivid, lifelike creations. Recommended to keep it at 0.65 to 0.80 weight. Supports resolutions up to 1536x1536.
Generate videos from prompts,images, and videos using LTX Video-0.9.5
Juggernaut Base Flux by RunDiffusion is a drop-in replacement for Flux [Dev] that delivers sharper details, richer colors, and enhanced realism, while instantly boosting LoRAs and LyCORIS with full compatibility.
Generate videos from prompts and videos using LTX Video-0.9.5
Juggernaut Base Flux by RunDiffusion is a drop-in replacement for Flux [Dev] that delivers sharper details, richer colors, and enhanced realism, while instantly boosting LoRAs and LyCORIS with full compatibility.
DiffRhythm is a blazing fast model for transforming lyrics into full songs. It boasts the capability to generate full songs in less than 30 seconds.
Generate high quality images from text prompts using CogView4. Longer text prompts will result in better quality images.
Professional-grade video upscaling using Topaz technology. Enhance your videos with high-quality upscaling.
Eye Correct is a video-to-video model that can correct eye direction in videos. It can be used to correct eye direction in videos.
Enhance low-resolution, blur, shadowed documents with the superior quality of docres for sharper, clearer results.
Enhance wraped, folded documents with the superior quality of docres for sharper, clearer results.
Enhance low-resolution images with the superior quality of Swin2SR for sharper, clearer results.
Generate video clips from your prompts using Kling 1.6 (pro)
Rapidly create image variations with Ideogram V2A Turbo Remix. Fast and efficient reimagining of existing images while maintaining creative control through prompt guidance.
Wan-2.1 1.3B is a text-to-video model that generates high-quality videos with high visual quality and motion diversity from text promptsat faster speeds.
Generate high-quality images, posters, and logos with Ideogram V2A. Features exceptional typography handling and realistic outputs optimized for commercial and creative use.
Generate high-speed text-to-speech audio using ElevenLabs TTS Turbo v2.5.
Generate multilingual text-to-speech audio using ElevenLabs TTS Multilingual v2.
Generate sound effects using ElevenLabs advanced sound effects model.
Accelerated image generation with Ideogram V2A Turbo. Create high-quality visuals, posters, and logos with enhanced speed while maintaining Ideogram's signature quality.
Isolate audio tracks using ElevenLabs advanced audio isolation technology.
Generate text from speech using ElevenLabs advanced speech-to-text model.
Create variations of existing images with Ideogram V2A Remix while maintaining creative control through prompt guidance.
Bring colors into old or new black and white photos with DDColor.
EVF-SAM2 combines natural language understanding with advanced segmentation capabilities, allowing you to precisely mask image regions using intuitive positive and negative text prompts.
SAM 2 is a model for segmenting images automatically. It can return individual masks or a single mask for the entire image.
Wan-2.1 is a text-to-video model that generates high-quality videos with high visual quality and motion diversity from text prompts
Generate video prompts using a variety of techniques including camera direction, style, pacing, special effects and more.
Upscale your images with DRCT-Super-Resolution.
Generate video clips more accurately with respect to initial image, natural language descriptions, and using camera movement instructions for shot control.
Veo 2 creates videos with realistic motion and high quality output. Explore different styles and find your own with extensive camera controls.
Use NAFNet to fix issues like blurriness and noise in your images. This model specializes in image restoration and can help enhance the overall quality of your photography.
Use NAFNet to fix issues like blurriness and noise in your images. This model specializes in image restoration and can help enhance the overall quality of your photography.
SkyReels V1 is the first and most advanced open-source human-centric video foundation model. By fine-tuning HunyuanVideo on O(10M) high-quality film and television clips
Post Processing is an endpoint that can enhance images using a variety of techniques including grain, blur, sharpen, and more.
Step-Video is a state-of-the-art (SoTA) text-to-video pre-trained model with 30 billion parameters and the capability to generate videos up to 204 frames.
A high-quality British English text-to-speech model offering natural and expressive voice synthesis.
A natural and expressive Brazilian Portuguese text-to-speech model optimized for clarity and fluency.
A high-quality Italian text-to-speech model delivering smooth and expressive speech synthesis.
The model provides you high quality image editing capabilities.
Kokoro is a lightweight text-to-speech model that delivers comparable quality to larger models while being significantly faster and more cost-efficient.
A natural-sounding Spanish text-to-speech model optimized for Latin American and European Spanish.
Ray2 is a large-scale video generative model capable of creating realistic visuals with natural, coherent motion.
A fast and expressive Hindi text-to-speech model with clear pronunciation and accurate intonation.
Clone voice of any person and speak anything in their voice using zonos' voice cloning.
A highly efficient Mandarin Chinese text-to-speech model that captures natural tones and prosody.
An expressive and natural French text-to-speech model for both European and Canadian French.
A fast and natural-sounding Japanese text-to-speech model optimized for smooth pronunciation.
GOT-OCR2 works on a wide range of tasks, including plain document OCR, scene text OCR, formatted document OCR, and even OCR for tables, charts, mathematical formulas, geometric shapes, molecular formulas and sheet music.
FLUX Control LoRA Canny is a high-performance endpoint that uses a control image using a Canny edge map to transfer structure to the generated image and another initial image to guide color.
FLUX Control LoRA Depth is a high-performance endpoint that uses a control image to transfer structure to the generated image, using a depth map.
FLUX Control LoRA Canny is a high-performance endpoint that uses a control image to transfer structure to the generated image, using a Canny edge map.
A fast and high quality model for image background removal.
FLUX Control LoRA Depth is a high-performance endpoint that uses a control image using a depth map to transfer structure to the generated image and another initial image to guide color.
A model for high quality and smooth background removal for videos.
Generate video clips more accurately with respect to natural language descriptions and using camera movement instructions for shot control.
Imagen3 is a high-quality text-to-image model that generates realistic images from text prompts.
Imagen3 Fast is a high-quality text-to-image model that generates realistic images from text prompts.
Ideogram Upscale enhances the resolution of the reference image by up to 2X and might enhance the reference image too. Optionally refine outputs with a prompt for guided improvements.
Image to Video for the Hunyuan Video model using a custom trained LoRA.
Lumina-Image-2.0 is a 2 billion parameter flow-based diffusion transforer which features improved performance in image quality, typography, complex prompt understanding, and resource-efficiency.
Fix distorted or blurred photos of people with CodeFormer.
Hunyuan Video is an Open video generation model with high visual quality, motion diversity, text-video alignment, and generation stability. Use this endpoint to generate videos from videos.
Hunyuan Video is an Open video generation model with high visual quality, motion diversity, text-video alignment, and generation stability. Use this endpoint to generate videos from videos.
Generate high quality video clips from text and image prompts using PixVerse v3.5
Generate high quality video clips quickly from text prompts using PixVerse v3.5 Fast
Generate high quality video clips from text and image prompts quickly using PixVerse v3.5 Fast
Generate high quality video clips from text prompts using PixVerse v3.5
DeepSeek Janus-Pro is a novel text-to-image model that unifies multimodal understanding and generation through an autoregressive framework
YuE is a groundbreaking series of open-source foundation models designed for music generation, specifically for transforming lyrics into full songs.
Ray2 is a large-scale video generative model capable of creating realistic visuals with natural, coherent motion.
Kling Kolors Virtual TryOn v1.5 is a high quality image based Try-On endpoint which can be used for commercial try on.
Compose videos from multiple media sources using FFmpeg API.
Get waveform data from audio files using FFmpeg API.
Get encoding metadata from video and audio files using FFmpeg API.
Generate video clips maintaining consistent, realistic facial features and identity across dynamic video content
MoonDreamNext Batch is a multimodal vision-language model for batch captioning.
Utilize Flux.1 [pro] Controlnet with a fine-tuned LoRA to generate high-quality images with precise control over composition, style, and structure through advanced edge detection and guidance mechanisms.
Utilize Flux.1 [dev] Controlnet to generate high-quality images with precise control over composition, style, and structure through advanced edge detection and guidance mechanisms.
FLUX LoRA for Pro endpoints.
Utilize Flux.1 [pro] Controlnet to generate high-quality images with precise control over composition, style, and structure through advanced edge detection and guidance mechanisms.
FLUX1.1 [pro] ultra fine-tuned is the newest version of FLUX1.1 [pro] with a fine-tuned LoRA, maintaining professional-grade image quality while delivering up to 2K resolution with improved photo realism.
FLUX1.1 [pro] is an enhanced version of FLUX.1 [pro], improved image generation capabilities, delivering superior composition, detail, and artistic fidelity compared to its predecessor.
FLUX.1 [pro] Fill Fine-tuned is a high-performance endpoint for the FLUX.1 [pro] model with a fine-tuned LoRA that enables rapid transformation of existing images, delivering high-quality style transfers and image modifications with the core FLUX capabilities.
Generate high-quality images from depth maps using Flux.1 [pro] depth estimation model with a fine-tuned LoRA. The model produces accurate depth representations for scene understanding and 3D visualization.
Generate high-quality images from depth maps using Flux.1 [pro] depth estimation model. The model produces accurate depth representations for scene understanding and 3D visualization.
Hunyuan Video is an Open video generation model with high visual quality, motion diversity, text-video alignment, and generation stability
Transform text into stunning videos with TransPixar - an AI model that generates both RGB footage and alpha channels, enabling seamless compositing and creative video effects.
Train Hunyuan Video lora on people, objects, characters and more!
Generate videos from prompts using CogVideoX-5B
Sa2VA is an MLLM capable of question answering, visual prompt understanding, and dense object segmentation at both image and video levels
Sa2VA is an MLLM capable of question answering, visual prompt understanding, and dense object segmentation at both image and video levels
Sa2VA is an MLLM capable of question answering, visual prompt understanding, and dense object segmentation at both image and video levels
Generate realistic lipsync animations from audio using advanced algorithms for high-quality synchronization.
Multimodal vision-language model for single/multi image understanding
Sa2VA is an MLLM capable of question answering, visual prompt understanding, and dense object segmentation at both image and video levels
Enhances a given raster image using 'crisp upscale' tool, boosting resolution with a focus on refining small details and faces.
MoonDreamNext Detection is a multimodal vision-language model for gaze detection, bbox detection, point detection, and more.
MoonDreamNext is a multimodal vision-language model for captioning, gaze detection, bbox detection, point detection, and more.
Generate video clips from your prompts using Kling 1.6 (std)
Generate video clips from your images using Kling 1.6 (std)
Automatically generates text captions for your videos from the audio as per text colour/font specifications
Switti is a scale-wise transformer for fast text-to-image generation that outperforms existing T2I AR models and competes with state-of-the-art T2I diffusion models while being faster than distilled diffusion models.
Switti is a scale-wise transformer for fast text-to-image generation that outperforms existing T2I AR models and competes with state-of-the-art T2I diffusion models while being faster than distilled diffusion models.
MMAudio generates synchronized audio given text inputs. It can generate sounds described by a prompt.
Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation
This endpoint delivers seamlessly localized videos by generating lip-synced dubs in multiple languages, ensuring natural and immersive multilingual experiences
Bria RMBG 2.0 enables seamless removal of backgrounds from images, ideal for professional editing tasks. Trained exclusively on licensed data for safe and risk-free commercial use. Model weights for commercial use are available here: https://share-eu1.hsforms.com/2GLpEVQqJTI2Lj7AMYwgfIwf4e04
Bria's Text-to-Image model for HD images. Trained exclusively on licensed data for safe and risk-free commercial use. Available also as source code and weights. For access to weights: https://bria.ai/contact-us
Bria GenFill enables high-quality object addition or visual transformation. Trained exclusively on licensed data for safe and risk-free commercial use. Access the model's source code and weights: https://bria.ai/contact-us
Blazing-fast text-to-speech. Generate audio with improved emotional tones and extensive multilingual support. Ideal for high-volume processing and efficient workflows.
Bria's Text-to-Image model, trained exclusively on licensed data for safe and risk-free commercial use. Available also as source code and weights. For access to weights: https://bria.ai/contact-us
Bria's Text-to-Image model with perfect harmony of latency and quality. Trained exclusively on licensed data for safe and risk-free commercial use. Available also as source code and weights. For access to weights: https://bria.ai/contact-us
Bria Expand expands images beyond their borders in high quality. Trained exclusively on licensed data for safe and risk-free commercial use. Access the model's source code and weights: https://bria.ai/contact-us
Bria Eraser enables precise removal of unwanted objects from images while maintaining high-quality outputs. Trained exclusively on licensed data for safe and risk-free commercial use. Access the model's source code and weights: https://bria.ai/contact-us
Bria Background Replace allows for efficient swapping of backgrounds in images via text prompts or reference image, delivering realistic and polished results. Trained exclusively on licensed data for safe and risk-free commercial use
FLUX.1 [dev] Fill is a high-performance endpoint for the FLUX.1 [pro] model that enables rapid transformation of existing images, delivering high-quality style transfers and image modifications with the core FLUX capabilities.
Place any product in any scenery with just a prompt or reference image while maintaining high integrity of the product. Trained exclusively on licensed data for safe and risk-free commercial use and optimized for eCommerce.
Image based high quality Virtual Try-On
Leffa Virtual TryOn is a high quality image based Try-On endpoint which can be used for commercial try on.
Leffa Pose Transfer is an endpoint for changing pose of an image with a reference image.
Generate music from text prompts using the MiniMax model, which leverages advanced AI techniques to create high-quality, diverse musical compositions.
Generate video clips from your images using MiniMax Video model
Recraft 20b is a new and affordable text-to-image model.
Generate video clips from your prompts using MiniMax model
Rodin by Hyper3D generates realistic and production ready 3D models from text or images.
Transform existing images with Ideogram V2's editing capabilities. Modify, adjust, and refine images while maintaining high fidelity and realistic outputs with precise prompt control.
Generate 3D models from your images using Trellis. A native 3D generative model enabling versatile and high-quality 3D asset creation.
Generate video clips from your prompts using Luma Dream Machine v1.5
FASHN delivers precise virtual try-on capabilities, accurately rendering garment details like text and patterns at 576x864 resolution from both on-model and flat-lay photo references.
Accelerated image generation with Ideogram V2 Turbo. Create high-quality visuals, posters, and logos with enhanced speed while maintaining Ideogram's signature quality.
Reimagine existing images with Ideogram V2's remix feature. Create variations and adaptations while preserving core elements and adding new creative directions through prompt guidance.
Rapidly create image variations with Ideogram V2 Turbo Remix. Fast and efficient reimagining of existing images while maintaining creative control through prompt guidance.
The video upscaler endpoint uses RealESRGAN on each frame of the input video to upscale the video to a higher resolution.
Edit images faster with Ideogram V2 Turbo. Quick modifications and adjustments while preserving the high-quality standards and realistic outputs of Ideogram.
Generate images from your prompts using Luma Photon Flash. Photon Flash is the most creative, personalizable, and intelligent visual models for creatives, bringing a step-function change in the cost of high-quality image generation.
Generate video clips from your prompts using Kling 1.0
AuraFlow v0.3 is an open-source flow-based text-to-image generation model that achieves state-of-the-art results on GenEval. The model is currently in beta.
OmniGen is a unified image generation model that can generate a wide range of images from multi-modal prompts. It can be used for various tasks such as Image Editing, Personalized Image Generation, Virtual Try-On, Multi Person Generation and more!
FLUX.1 [schnell] Redux is a high-performance endpoint for the FLUX.1 [schnell] model that enables rapid transformation of existing images, delivering high-quality style transfers and image modifications with the core FLUX capabilities.
Generate video clips from your prompts using Kling 1.5 (pro)
FLUX.1 [schnell] is a 12 billion parameter flow transformer that generates high-quality images from text in 1 to 4 steps, suitable for personal and commercial use.
Generate videos from images using LTX Video
FLUX1.1 [pro] ultra Redux is a high-performance endpoint for the FLUX1.1 [pro] model that enables rapid transformation of existing images, delivering high-quality style transfers and image modifications with the core FLUX capabilities.
Generate high-quality images from depth maps using Flux.1 [dev] depth estimation model. The model produces accurate depth representations for scene understanding and 3D visualization.
FLUX.1 [dev] Redux is a high-performance endpoint for the FLUX.1 [dev] model that enables rapid transformation of existing images, delivering high-quality style transfers and image modifications with the core FLUX capabilities.
FLUX1.1 [pro] Redux is a high-performance endpoint for the FLUX1.1 [pro] model that enables rapid transformation of existing images, delivering high-quality style transfers and image modifications with the core FLUX capabilities.
FLUX.1 [pro] Fill is a high-performance endpoint for the FLUX.1 [pro] model that enables rapid transformation of existing images, delivering high-quality style transfers and image modifications with the core FLUX capabilities.
FLUX.1 [pro] Redux is a high-performance endpoint for the FLUX.1 [pro] model that enables rapid transformation of existing images, delivering high-quality style transfers and image modifications with the core FLUX capabilities.
Photorealistic Image-to-Image
An endpoint for re-lighting photos and changing their backgrounds per a given description
Mochi 1 preview is an open state-of-the-art video generation model with high-fidelity motion and strong prompt adherence in preliminary evaluation.
FLUX.1 Differential Diffusion is a rapid endpoint that enables swift, granular control over image transformations through change maps, delivering fast and precise region-specific modifications while maintaining FLUX.1 [dev]'s high-quality output.
Recraft V3 Create Style is capable of creating unique styles for Recraft V3 based on your images.
An endpoint for personalized image generation using Flux as per given description.
bilateral reference framework (BiRefNet) for high-resolution dichotomous image segmentation (DIS)
Stable Diffusion 3.5 Medium is a Multimodal Diffusion Transformer (MMDiT) text-to-image model that features improved performance in image quality, typography, complex prompt understanding, and resource-efficiency.
Hunyuan Video is an Open video generation model with high visual quality, motion diversity, text-video alignment, and generation stability. This endpoint generates videos from text descriptions.
Generate videos from videos and prompts using CogVideoX-5B
F5 TTS
Generate videos from images and prompts using CogVideoX-5B
Use any vision language model from our selected catalogue (powered by OpenRouter)
Generate video clips from your images using Kling 1.5 (pro)
Generate video clips from your images using Kling 1.0 (pro)
Generate video clips from your images using Kling 1.0
Generate video clips from your prompts using Kling 1.0 (pro)
Generate videos from prompts using LTX Video
FLUX.1 [pro] new is an accelerated version of FLUX.1 [pro], maintaining professional-grade image quality while delivering significantly faster generation speeds.
Transfer expression from a video to a portrait.
A general purpose endpoint for the FLUX.1 [dev] model, implementing the RF-Inversion pipeline. This can be used to edit a reference image based on a prompt.
Generate short video clips from your images using SVD v1.1
MiDaS depth estimation preprocessor.
Line art preprocessor.
TEED (Temporal Edge Enhancement Detection) preprocessor.
Holistically-Nested Edge Detection (HED) preprocessor.
Generate short video clips from your prompts using SVD v1.1
PIDI (Pidinet) preprocessor.
Scribble preprocessor.
M-LSD line segment detection preprocessor.
ZoeDepth preprocessor.
Segment Anything Model (SAM) preprocessor.
Depth Anything v2 preprocessor.
Animate a reference image with a driving video using ControlNeXt.
Multimodal vision-language model for video understanding
Stable Diffusion 3 Medium (Text to Image) is a Multimodal Diffusion Transformer (MMDiT) model that improves image quality, typography, prompt understanding, and efficiency.
SAM 2 is a model for segmenting images and videos in real-time.
SAM 2 is a model for segmenting images and videos in real-time.
FLUX General Image-to-Image is a versatile endpoint that transforms existing images with support for LoRA, ControlNet, and IP-Adapter extensions, enabling precise control over style transfer, modifications, and artistic variations through multiple guidance methods.
FLUX General Inpainting is a versatile endpoint that enables precise image editing and completion, supporting multiple AI extensions including LoRA, ControlNet, and IP-Adapter for enhanced control over inpainting results and sophisticated image modifications.
A specialized FLUX endpoint combining differential diffusion control with LoRA, ControlNet, and IP-Adapter support, enabling precise, region-specific image transformations through customizable change maps.
FLUX LoRA Image-to-Image is a high-performance endpoint that transforms existing images using FLUX models, leveraging LoRA adaptations to enable rapid and precise image style transfer, modifications, and artistic variations.
Default parameters with automated optimizations and quality improvements.
Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation
Super fast endpoint for the FLUX.1 [schnell] model with subject input capabilities, enabling rapid and high-quality image generation for personalization, specific styles, brand identities, and product-specific outputs.
Sana can synthesize high-resolution, high-quality images with strong text-image alignment at a remarkably fast speed, with the ability to generate 4K images in less than a second.
An efficent SDXL multi-controlnet text-to-image model.
An efficent SDXL multi-controlnet inpainting model.
An efficent SDXL multi-controlnet image-to-image model.
Photorealistic Text-to-Image
Interpolate between image frames
Transfer expression from a video to a portrait.
A powerful image to novel multiview model with normals.
Stable Cascade: Image generation on a smaller & cheaper latent space.
Florence-2 is an advanced vision foundation model that uses a prompt-based approach to handle a wide range of vision and vision-language tasks
Florence-2 is an advanced vision foundation model that uses a prompt-based approach to handle a wide range of vision and vision-language tasks
Florence-2 is an advanced vision foundation model that uses a prompt-based approach to handle a wide range of vision and vision-language tasks
Florence-2 is an advanced vision foundation model that uses a prompt-based approach to handle a wide range of vision and vision-language tasks
Florence-2 is an advanced vision foundation model that uses a prompt-based approach to handle a wide range of vision and vision-language tasks
Florence-2 is an advanced vision foundation model that uses a prompt-based approach to handle a wide range of vision and vision-language tasks
Florence-2 is an advanced vision foundation model that uses a prompt-based approach to handle a wide range of vision and vision-language tasks
Florence-2 is an advanced vision foundation model that uses a prompt-based approach to handle a wide range of vision and vision-language tasks
Florence-2 is an advanced vision foundation model that uses a prompt-based approach to handle a wide range of vision and vision-language tasks
Florence-2 is an advanced vision foundation model that uses a prompt-based approach to handle a wide range of vision and vision-language tasks
Florence-2 is an advanced vision foundation model that uses a prompt-based approach to handle a wide range of vision and vision-language tasks
Florence-2 is an advanced vision foundation model that uses a prompt-based approach to handle a wide range of vision and vision-language tasks
Florence-2 is an advanced vision foundation model that uses a prompt-based approach to handle a wide range of vision and vision-language tasks
Florence-2 is an advanced vision foundation model that uses a prompt-based approach to handle a wide range of vision and vision-language tasks
Run SDXL at the speed of light
Stable Diffusion 3 Medium (Image to Image) is a Multimodal Diffusion Transformer (MMDiT) model that improves image quality, typography, prompt understanding, and efficiency.
Anime finetune of Würstchen V3.
Generate short video clips from your images using SVD v1.1 at Lightning Speed
Generate video clips from your images using Luma Dream Machine v1.5
Generate images from your prompts using Luma Photon. Photon is the most creative, personalizable, and intelligent visual models for creatives, bringing a step-function change in the cost of high-quality image generation.
Predict poses.
SD 1.5 ControlNet
SOTA Image Upscaler
State-of-the-art open-source model in aesthetic quality
Hyper-charge SDXL's performance and creativity.
Collection of SDXL Lightning models.
Generate realistic images.
Hyper-charge SDXL's performance and creativity.
Any pose, any style, any identity
Dreamshaper model.
High quality zero-shot personalization
Run Any Stable Diffusion model with customizable LoRA weights.
Run Any Stable Diffusion model with customizable LoRA weights.
Run SDXL at the speed of light
Stable Diffusion v1.5
Run SDXL at the speed of light
SDXL with an alpha channel.
Run SDXL at the speed of light
MuseTalk is a real-time high quality audio-driven lip-syncing model. Use MuseTalk to animate a face with your own audio.
Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation
[Experimental] Whisper v3 Large -- but optimized by our inference wizards. Same WER, double the performance!
Predict the probability of an image being NSFW.
Answer questions from the images.
Fooocus extreme speed mode as a standalone app.
Create stickers from faces.
Customizing Realistic Human Photos via Stacked ID Embedding
Generate short video clips from your prompts
Generate Images with ControlNet.
Create creative upscaled images.
Re-animate your videos with evolved consistency!
bilateral reference framework (BiRefNet) for high-resolution dichotomous image segmentation (DIS)
State-of-the-art open-source model in aesthetic quality
Interpolate between video frames
Run SDXL at the speed of light
Run SDXL at the speed of light
State-of-the-art open-source model in aesthetic quality
Animate your ideas!
Hyper-charge SDXL's performance and creativity.
Whisper is a model for speech transcription and translation.
Run SDXL at the speed of light
Run SDXL at the speed of light
Run SDXL at the speed of light
Run SDXL at the speed of light
Run SDXL at the speed of light
Fooocus extreme speed mode as a standalone app.
Use any large language model from our selected catalogue (powered by OpenRouter)
Vision
Animate your ideas in lightning speed!
Default parameters with automated optimizations and quality improvements.
Default parameters with automated optimizations and quality improvements.
Re-animate your videos!
Create depth maps using Midas depth estimation.
Generate short video clips from your images using SVD v1.1 at Lightning Speed
Create illusions conditioned on image.
Re-animate your videos in lightning speed!
Generate video clips from your prompts using MiniMax model
Re-animate your videos with evolved consistency!
Automatically retouches faces to smooth skin and remove blemishes.
Run SDXL at the speed of light
Produce high-quality images with minimal inference steps.
State of the art Image to 3D Object generation
Diffusion based high quality edge detection
Open source text-to-audio model.
Create depth maps using Marigold depth estimation.
Tuning-free ID customization.
Generate Images with ControlNet.
Generate Images with ControlNet.
Default parameters with automated optimizations and quality improvements.
Animate Your Drawings with Latent Consistency Models!
Produce high-quality images with minimal inference steps. Optimized for 512x512 input image size.
Inpaint images with SD and SDXL
Upscale images by a given factor.
Remove the background from an image.
Run Any Stable Diffusion model with customizable LoRA weights.