
Turn simple sketches into detailed, fully-textured 3D models. Instantly convert your concept designs into formats ready for Unity, Unreal, and Blender.

Generate seamlessly tiling photorealistic images from text using Z-Image Turbo

Generate video clips from your multiple image references using Vidu Q1

The OpenRouter Responses API with fal, powered by OpenRouter, provides unified access to a wide range of large language models - including GPT, Claude, Gemini, and many others through a single API interface.

Use the capabilities of the hunyuan foley model to bring life to your videos by adding sound effect to them.

OmniGen is a unified image generation model that can generate a wide range of images from multi-modal prompts. It can be used for various tasks such as Image Editing, Personalized Image Generation, Virtual Try-On, Multi Person Generation and more!

Transfer expression from a video to a portrait.
![FLUX.1 [dev] is a 12 billion parameter flow transformer that generates high-quality images from text. It is suitable for personal and commercial use.](https://refinery.fal.media/url/https%3A%2F%2Fstorage.googleapis.com%2Ffal_cdn%2Ffal%2FUpscale-2.jpg/tr:w-1920,q-80/Upscale-2.webp)
FLUX.1 [dev] is a 12 billion parameter flow transformer that generates high-quality images from text. It is suitable for personal and commercial use.

SAM 3 is a unified foundation model for promptable segmentation in images and videos. It can detect, segment, and track objects using text or visual prompts such as points, boxes, and masks.

Transform your consistent character into different art styles, settings, or scenarios while maintaining their distinctive appearance and identity

Create depth maps using Marigold depth estimation.

Moondream2 is a highly efficient open-source vision language model that combines powerful image understanding capabilities with a remarkably small footprint.
Automatically generates text captions for your videos from the audio as per text colour/font specifications

Create custom voices using Qwen3-TTS Voice Design model and later use Clone Voice model to create your own voices!

Generates depth maps from video using Video Depth Anything (CVPR 2025). Produces per-frame depth estimation with temporal consistency across frames. Supports 3 model sizes (Small, Base, Large), 5 colormaps including grayscale, side-by-side comparison with the original video, and raw depth export as .npz. Useful for 3D reconstruction, video effects, compositing, and scene understanding.

Florence-2 is an advanced vision foundation model that uses a prompt-based approach to handle a wide range of vision and vision-language tasks

Whether you're working on memes, videos, games, or AI agents, Chatterbox brings your content to life. Use the first tts from resemble ai.

Vidu's Q3 Turbo Model.
![Utilize Flux.1 [dev] Controlnet to generate high-quality images with precise control over composition, style, and structure through advanced edge detection and guidance mechanisms.](https://refinery.fal.media/url/https%3A%2F%2Fstorage.googleapis.com%2Ffalserverless%2Fgallery%2Fflux_lora.jpg/tr:w-1920,q-80/flux_lora.webp)
Utilize Flux.1 [dev] Controlnet to generate high-quality images with precise control over composition, style, and structure through advanced edge detection and guidance mechanisms.

Generate high-quality videos with UGC-like avatars from text

Maya1 is a state-of-the-art speech model by Maya Research for expressive voice generation, built to capture real human emotion and precise voice design.

Bria GenFill enables high-quality object addition or visual transformation. Trained exclusively on licensed data for safe and risk-free commercial use. Access the model's source code and weights: https://bria.ai/contact-us

Use the amazing capabilities of hunyuan image 2.1 to generate images that express the feelings of your text.

High-quality voice cloning TTS model that generates 48kHz speech from text and a reference audio. Distilled to 4 steps for fast inference.

NVIDIA's Logically Consistent and Physics-Aware Image Editing Model

VOID removes objects from videos along with all interactions they induce on the scene

Generate 3D models from a single image using Tripo P1.

Stable Diffusion 3 Medium (Image to Image) is a Multimodal Diffusion Transformer (MMDiT) model that improves image quality, typography, prompt understanding, and efficiency.