Model Gallery

See all available model APIs provided by fal.ai
Can't find a model?Suggest Model
Available now

AuraFlow is here!

Discover the latest in text-to-image technology with enhanced multi-subject capabilities, improved image quality, and better spelling accuracy.

Fal.ai demos with unmatched AI speed

Explore Models

AuraFlow

Fully open flow based text to image model

text-to-image
inference
optimized
Stable Diffusion V3

Run SD3 at the speed of light

text-to-image
inference
optimized
Stable Diffusion XL

Run SDXL at the speed of light

text-to-image
inference
loras
Stable Diffusion with LoRAs

Run Any Stable Diffusion model with customizable LoRA weights.

text-to-image
inference
loras
AuraSR

Upscale your images with AuraSR.

image-to-image
inference
upscaler
Stable Cascade

Stable Cascade: Image generation on a smaller & cheaper latent space.

text-to-image
inference
lcm
High Quality Stable Video Diffusion

Generate short video clips from your images using SVD v1.1

image-to-video
inference
video
Birefnet Background Removal

bilateral reference framework (BiRefNet) for high-resolution dichotomous image segmentation (DIS)

image-to-image
background
utility
Creative Upscaler

Create creative upscaled images.

image-to-image
inference
upscaler
Clarity Upscaler

Clarity upscaler for images with high fidelity.

image-to-image
inference
upscaler
CCSR Upscaler

SOTA Image Upscaler

image-to-image
inference
upscaler
Stable Diffusion Turbo (v1.5/XL)

Run SDXL at the speed of light

text-to-image
real-time
Latent Consistency Models (v1.5/XL)

Run SDXL at the speed of light

text-to-image
real-time
Whisper

Whisper is a model for speech transcription and translation.

speech-to-text
inference
speech
Wizper (Whisper v3 -- fal.ai edition)

[Experimental] Whisper v3 Large -- but optimized by our inference wizards. Same WER, double the performance!

speech-to-text
inference
speech
Stable Diffusion XL Lightning

Run SDXL at the speed of light

text-to-image
real-time
Hyper SDXL

Hyper-charge SDXL's performance and creativity.

text-to-image
real-time
Playground v2.5

State-of-the-art open-source model in aesthetic quality

text-to-image
inference
artistic
Japanese Stable Diffusion XL

Japanese-specific SDXL model that is capable of inputting prompts in Japanese and generating Japanese-style images.

text-to-image
inference
localized
AMT Interpolation

Interpolate between video frames

video-to-video
inference
video
T2V Turbo - Video Crafter

Generate short video clips from your prompts

text-to-video
inference
video
SD 1.5 Depth ControlNet

SD 1.5 ControlNet

image-to-image
inference
depth
ControlNet Tile Upscaler

ControlNet Tile Upscaler

image-to-image
inference
controlnet
PhotoMaker

Customizing Realistic Human Photos via Stacked ID Embedding

image-to-image
inference
realistic
Latent Consistency (SDXL & SDv1.5)

Produce high-quality images with minimal inference steps.

text-to-image
real-time
Optimized Latent Consistency (SDv1.5)

Produce high-quality images with minimal inference steps. Optimized for 512x512 input image size.

image-to-image
real-time
Fooocus

Default parameters with automated optimizations and quality improvements.

text-to-image
inference
stylized
InstantID

Zero-shot Identity-Preserving Generation in Seconds

image-to-image
inference
AnimateDiff Video-to-Video Evolved

Re-animate your videos with evolved consistency!

video-to-video
inference
video
AnimateDiff

Animate your ideas!

text-to-video
inference
video
AnimateDiff Turbo

Animate your ideas in lightning speed!

text-to-video
inference
video
MetaVoice

MetaVoice-1B is a 1.2B parameter base model trained on 100K hours of speech for TTS (text-to-speech).

text-to-speech
inference
speech
Illusion Diffusion

Create illusions conditioned on image.

text-to-image
inference
stylized
Segment Anything Model

SAM.

image-to-image
inference
mask
TinySAM Distilled Segment Anything Model

TinySAM.

image-to-image
inference
mask
Midas Depth Estimation

Create depth maps using Midas depth estimation.

image-to-image
inference
utility
Remove Background

Remove the background from an image.

image-to-image
background
utility
Upscale Images

Upscale images by a given factor.

image-to-image
inference
upscaler
ControlNet SDXL

Generate Images with ControlNet.

text-to-image
inference
controlnet
Inpainting sdxl and sd

Inpaint images with SD and SDXL

image-to-image
inference
inpainting
Animatediff SparseCtrl LCM

Animate Your Drawings with Latent Consistency Models!

text-to-video
inference
lcm
Swap Face

Swap a face between two images.

image-to-image
inference
utility
PuLID

Tuning-free ID customization.

image-to-image
inference
utility
IP Adapter Face ID

High quality zero-shot personalization

image-to-image
inference
personalization
Marigold Depth Estimation

Create depth maps using Marigold depth estimation.

image-to-image
inference
depth
XTTS

text-to-audio
inference
utility
Stable Audio Open

Open source text-to-audio model.

text-to-audio
inference
audio
DiffusionEdge

Diffusion based high quality edge detection

text-to-image
inference
Stable Diffusion XL Image to Image with LoRAs

Run Stable Diffusion XL with customizable LoRA weights.

image-to-image
inference
stylized
TripoSR

State of the art Image to 3D Object generation

image-to-3d
inference
stylized
Remeshing

Remesh an existing 3D object

3d-to-3d
inference
Face Retoucher

Automatically retouches faces to smooth skin and remove blemishes.

image-to-image
inference
utility
LLaVA v1.5 13B

Vision

vision
inference
streaming
LLaVA v1.6 34B

Vision

vision
inference
NSFW Filter

Predict the probability of an image being NSFW.

vision
inference
utility
SUPIR Upscaler

A Powerful Image Upscaler

image-to-image
inference
upscaler
Face to Sticker

Create stickers from faces.

image-to-image
inference
utility
ControlNet Scribble

Generate images from scribbled conditioned images.

image-to-image
inference
utility
Moondream

Answer questions from the images.

vision
inference
utility
Sad Talker

Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation

image-to-video
inference
Stable Diffusion with LoRAs

Run Any Stable Diffusion model with customizable LoRA weights.

image-to-image
inference
loras
Stable Diffusion XL

Run SDXL at the speed of light

image-to-image
inference
loras
Stable Diffusion XL

Run SDXL at the speed of light

image-to-image
inference
inpainting
PixArt-Σ

Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation

text-to-image
inference
realistic
Dreamshaper

Dreamshaper model.

text-to-image
inference
stylized
Realistic Vision

Generate realistic images.

text-to-image
inference
stylized
Lightning Models

Collection of SDXL Lightning models.

text-to-image
inference
stylized
Omni Zero

Any pose, any style, any identity

image-to-image
inference
stylized
Idefics2 8B

Idefics2 is an open multimodal model that accepts arbitrary sequences of image and text inputs and produces text outputs.

vision
inference
InternLM XComposer 2 7B

A general vision-language large model (VLLM) based on InternLM2, with the capability of 4K resolution image understanding.

vision
inference
LLava Phi 3 Mini

A LLaVA model fine-tuned from microsoft/Phi-3-mini-4k-instruct and CLIP-ViT-Large-patch14-336 with ShareGPT4V-PT and InternVL-SFT by XTuner.

vision
inference
Mantis LLava 7B v1.1

A multimodal conversational AI model that can chat with users about images and text. It's optimized for multi-image reasoning, where interleaved text and images can be used fed as the input to generate responses.

vision
inference
Lipsync

A lipsync model that synchronizes speech to face movements.

video
inference
Qwen VL Chat 7B Int4

A visual multimodal version of the large model series, Qwen (abbr. Tongyi Qianwen), proposed by Alibaba Cloud. Qwen-VL accepts image, text, and bounding box as inputs, outputs text and bounding box.

vision
inference
Virtual Try-On

Image based Virtual Try-On

image-to-image
inference
stylized
LLaVA Llama3 8B

A model fine-tuned from meta-llama/Meta-Llama-3-8B-Instruct and CLIP-ViT-Large-patch14-336 with LLaVA-Pretrain and LLaVA-Instruct by XTuner.

vision
inference
ToonCrafter

Create animated videos from keyframe images.

image-to-video
inference
stylized
DWPose Pose Prediction

Predict poses.

image-to-image
inference
utility
SoteDiffusion

Anime finetune of Würstchen V3.

text-to-image
inference
lcm
Florence-2 Large

Florence-2 is an advanced vision foundation model that uses a prompt-based approach to handle a wide range of vision and vision-language tasks

image-to-text
inference
optimized
Live Portrait

Transfer expression from a video to a portrait.

image-to-video
inference