The inference platform for genmedia

fal Serverless powers 1,300+ production endpoints across thousands of GPUs. Deploy custom workloads with autoscaling, observability, and performance controls built for inference at scale.

GPU usage over time

895 GPUs in use

Requests by status code

Success

2XX

Warning

4XX

Error

5XX

Request traffic

Processed

190.41 req/s

Received

192.92 req/s

Concurrent requests

Last

3,053

Max

4,815

Mean

4,005

Requests by app

GPU usage over time

895 GPUs in use

Requests by status code

Success

2XX

Warning

4XX

Error

5XX

Request traffic

Processed

190.41 req/s

Received

192.92 req/s

Concurrent requests

Last

3,053

Max

4,815

Mean

4,005

Requests by app

GPU usage over time

895 GPUs in use

Requests by status code

Success

2XX

Warning

4XX

Error

5XX

Request traffic

Processed

190.41 req/s

Received

192.92 req/s

Concurrent requests

Last

3,053

Max

4,815

Mean

4,005

Requests by app

GPU usage over time

895 GPUs in use

Requests by status code

Success

2XX

Warning

4XX

Error

5XX

Request traffic

Processed

190.41 req/s

Received

192.92 req/s

Concurrent requests

Last

3,053

Max

4,815

Mean

4,005

Requests by app

Try live endpoints, built on Serverless

Serverless powers private APIs, fal models, and public marketplace endpoints. Try live endpoints across every modality, then deploy your own privately or publish to the marketplace.

Explore all

Video

Aurora

Generate high fidelity, studio quality videos of your avatar speaking or singing using the Aurora from Creatify team.

Video

Fabric 1.0

VEED Fabric 1.0 is an image-to-video API that turns any image into a talking video

Image

Flux 2

Text-to-image generation with FLUX.2 [dev] from Black Forest Labs. Enhanced realism, crisper text generation, and native editing capabilities.

Realtime

Flux 2 Klein

Realtime generation with FLUX.2 [klein] from Black Forest Labs.

Trellis 2

Generate 3D models from your images using Trellis 2. A native 3D generative model enabling versatile and high-quality 3D asset creation.

Video

Avatar 4

Turn a single photo or video into a lifelike talking avatar with natural expressions, head tilts, and gestures using HeyGen Avatar 4.

Everything you need to operate at scale

Built-in deployment, observability, fault tolerance, and autoscaling in one unified platform.

Watch now:

What’s new in Serverless?

Deploy to production without managing infrastructure

Migrate quickly by bringing your own custom container image.
Host model weights on fal’s distributed /data volume with lightning fast reads.
fal deploy builds, pushes, warms, and serves your model behind a stable endpoint.

app.py

import fal

from pydantic import BaseModel, Field

from fal.toolkit import Image

class Input(BaseModel):

prompt: str = Field(

description="The prompt to generate an image from",

examples=["A professional image of a cat"],

)

class Output(BaseModel):

image: Image

class ImageGenerator(fal.App):

app_name = "image-generator"

machine_type = "GPU-H100"

min_concurrency=0

max_concurrency = 20

requirements = [

"hf-transfer==0.1.9",

"diffusers[torch]==0.32.2",

"transformers[sentencepiece]==4.51.0",

"accelerate==1.6.0",

]

def setup(self):

import torch

from diffusers import StableDiffusionXLPipeline

self.pipe = StableDiffusionXLPipeline.from_pretrained(

"stabilityai/stable-diffusion-xl-base-1.0",

torch_dtype=torch.float16,

variant="fp16",

use_safetensors=True,

).to("cuda")

# Warmup the model before the first request

self.warmup()

def warmup(self):

self.pipe("A professional image of a cat")

@fal.endpoint("/")

def run(self, request: Input) -> Output:

result = self.pipe(request.prompt)

image = Image.from_pil(result.images[0])

return Output(image=image)

fal CLI

$ fal deploy app.py

✓ registering ImageGenerator · GPU-H100

✓ building image · 2.4 GB (11s)

✓ pushing to fal-registry (3.1s)

✓ deployed → fal.run/acme/image-generator

scaling 0 → 20 · cold-start 0.41s

app.py

import fal

from pydantic import BaseModel, Field

from fal.toolkit import Image

class Input(BaseModel):

prompt: str = Field(

description="The prompt to generate an image from",

examples=["A professional image of a cat"],

)

class Output(BaseModel):

image: Image

class ImageGenerator(fal.App):

app_name = "image-generator"

machine_type = "GPU-H100"

min_concurrency=0

max_concurrency = 20

requirements = [

"hf-transfer==0.1.9",

"diffusers[torch]==0.32.2",

"transformers[sentencepiece]==4.51.0",

"accelerate==1.6.0",

]

def setup(self):

import torch

from diffusers import StableDiffusionXLPipeline

self.pipe = StableDiffusionXLPipeline.from_pretrained(

"stabilityai/stable-diffusion-xl-base-1.0",

torch_dtype=torch.float16,

variant="fp16",

use_safetensors=True,

).to("cuda")

# Warmup the model before the first request

self.warmup()

def warmup(self):

self.pipe("A professional image of a cat")

@fal.endpoint("/")

def run(self, request: Input) -> Output:

result = self.pipe(request.prompt)

image = Image.from_pil(result.images[0])

return Output(image=image)

fal CLI

$ fal deploy app.py

✓ registering ImageGenerator · GPU-H100

✓ building image · 2.4 GB (11s)

✓ pushing to fal-registry (3.1s)

✓ deployed → fal.run/acme/image-generator

scaling 0 → 20 · cold-start 0.41s

app.py

import fal

from pydantic import BaseModel, Field

from fal.toolkit import Image

class Input(BaseModel):

prompt: str = Field(

description="The prompt to generate an image from",

examples=["A professional image of a cat"],

)

class Output(BaseModel):

image: Image

class ImageGenerator(fal.App):

app_name = "image-generator"

machine_type = "GPU-H100"

min_concurrency=0

max_concurrency = 20

requirements = [

"hf-transfer==0.1.9",

"diffusers[torch]==0.32.2",

"transformers[sentencepiece]==4.51.0",

"accelerate==1.6.0",

]

def setup(self):

import torch

from diffusers import StableDiffusionXLPipeline

self.pipe = StableDiffusionXLPipeline.from_pretrained(

"stabilityai/stable-diffusion-xl-base-1.0",

torch_dtype=torch.float16,

variant="fp16",

use_safetensors=True,

).to("cuda")

# Warmup the model before the first request

self.warmup()

def warmup(self):

self.pipe("A professional image of a cat")

@fal.endpoint("/")

def run(self, request: Input) -> Output:

result = self.pipe(request.prompt)

image = Image.from_pil(result.images[0])

return Output(image=image)

fal CLI

$ fal deploy app.py

✓ registering ImageGenerator · GPU-H100

✓ building image · 2.4 GB (11s)

✓ pushing to fal-registry (3.1s)

✓ deployed → fal.run/acme/image-generator

scaling 0 → 20 · cold-start 0.41s

app.py

import fal

from pydantic import BaseModel, Field

from fal.toolkit import Image

class Input(BaseModel):

prompt: str = Field(

description="The prompt to generate an image from",

examples=["A professional image of a cat"],

)

class Output(BaseModel):

image: Image

class ImageGenerator(fal.App):

app_name = "image-generator"

machine_type = "GPU-H100"

min_concurrency=0

max_concurrency = 20

requirements = [

"hf-transfer==0.1.9",

"diffusers[torch]==0.32.2",

"transformers[sentencepiece]==4.51.0",

"accelerate==1.6.0",

]

def setup(self):

import torch

from diffusers import StableDiffusionXLPipeline

self.pipe = StableDiffusionXLPipeline.from_pretrained(

"stabilityai/stable-diffusion-xl-base-1.0",

torch_dtype=torch.float16,

variant="fp16",

use_safetensors=True,

).to("cuda")

# Warmup the model before the first request

self.warmup()

def warmup(self):

self.pipe("A professional image of a cat")

@fal.endpoint("/")

def run(self, request: Input) -> Output:

result = self.pipe(request.prompt)

image = Image.from_pil(result.images[0])

return Output(image=image)

fal CLI

$ fal deploy app.py

✓ registering ImageGenerator · GPU-H100

✓ building image · 2.4 GB (11s)

✓ pushing to fal-registry (3.1s)

✓ deployed → fal.run/acme/image-generator

scaling 0 → 20 · cold-start 0.41s

Full observability from alert to resolution

See what’s happening across every deployment with logs, request traces, analytics, app events, and latency breakdowns in one dashboard.

Diagnose bottlenecks fast across cold starts, execution time, errors, and request behavior.
Configure alerts with custom thresholds across app, Slack, and email.
Integrate metrics into your existing observability stack through the Platform API.

Tune every workload for speed and efficiency

Minimize cold start times with FlashPack for fast model loading, persistent /data storage for model weights and caching compilation artifacts.
Tune autoscaling live by adjusting concurrency buffers, scaling delay, and warm capacity from the dashboard.
Get expert optimization support from fal engineers with the same performance expertise behind some of the world’s fastest diffusion inference pipelines.

Scale to thousands of GPUs instantly

Run large-scale production workloads across H100s, H200s, B200s, B300s, RTX PRO 6000s, and more.

Eliminate cold starts by setting min_concurrency to keep a baseline of warm runners.
Control burst capacity with max_concurrency and concurrency_buffer to absorb demand spikes.

Pay for what runs

Our Serverless and compute pricing. Find the right plan for your workload.

Explore Pricing

GPU / Hardware

B300 (288GB)

List Price

$8.50

As low as

$4.49

GPU / Hardware

B200 (180GB)

List Price

$6.25

As low as

$3.49

GPU / Hardware

H200 (141GB)

List Price

$4.50

As low as

$2.10

GPU / Hardware

H100 (80GB)

List Price

$3.99

As low as

$1.89

GPU / Hardware

RTX PRO
6000 (96GB)

List Price

$2.99

As low as

$1.10

Inference infrastructure that keeps pace with AI

Contact Sales Documentation

Battle-tested

Optimized in production, every day

Every optimization, every reliability improvement, every performance gain gets stress-tested against our own production workloads before it ever reaches yours. We ship to ourselves first.

Elastic

Evergreen by necessity

The AI model landscape moves fast. Because we're continuously onboarding new models and architectures to our own platform, fal serverless is constantly being updated to support new model formats, serving patterns, and hardware optimizations.

Dedicated eng

Trusted by teams building what's next

We've served billions of inference requests across thousands of models. That scale is why Canva, Heygen, Krea and many more chose fal when it mattered most.

1,300+ endpoints in production

Scale to 1000s of GPUs

99.99% uptime SLA

Billions of requests served a year

Purpose-built for every media modality

fal supports every workflow in your stack, from fine-tuning to realtime inference, across every major modality.

Video

Realtime

Comfy UI

World-Models

3D

LoRA Training

Built by fal to run fal

2022

2023

2024

2025

2026

Inference runtime is born

We created fal Serverless to run our own inference workloads.

100+ models deployed

Queues, webhooks, caching, logging, and analytics infrastructure turned the runtime into a production platform used to serve models at scale.

Expanded to realtime and multimodal

As AI expanded beyond images, fal Serverless evolved to support realtime apps, audio, video, 3D, containers, and complex workflows.

Multi-GPU and next-gen hardware

Multi-GPU execution and support for H200/B200-class infrastructure enabled larger models, faster video generation, and higher-throughput inference.

World Model Accelerator

A new interface to fal's core primitives, purpose-built for world models.

FAQ

What is serverless inference?

Serverless inference lets you run AI models without managing GPU infrastructure. Traditional serverless platforms focus on general cloud functions, while fal is purpose-built for AI inference with lightning-fast execution, scalability, and enterprise reliability. fal handles GPU provisioning, autoscaling, cold starts, observability, and production deployment so teams can run custom image, video, audio, 3D, and world models with low latency and usage-based pricing. It is ideal for workloads needing bursty demand, fast iteration, extreme latency optimization, and production-scale AI inference. fal's serverless infrastructure doesn't just power our customers' applications, it powers fal itself, every model, every inference call, every workload running on our platform. So the reliability bar here isn't theoretical.

What’s the best platform to deploy custom AI models?

fal is purpose-built for deploying custom AI models in production, especially generative media workloads like image, video, audio, 3D, and world models. Teams use fal Serverless because of their low GPU pricing, fast cold starts, low latency, high throughput, enterprise reliability, and hands-on support from AI infrastructure experts. Today over 2.5 million developers build on fal, and companies like Canva, HeyGen, Krea, Veed, Creatify, Fashn deploy custom AI models on fal serverless. fal processes millions of daily inference calls with 99.99% uptime, and demand continues to accelerate as more developers integrate generative media capabilities into their applications. fal also provides access to more than 1,000 production-ready image, video, audio, and 3D models through a unified API, enabling developers to build and scale generative media applications with enterprise-grade reliability.

Which platform has the best GPU pricing for H100, B200, and B300 inference?

fal offers highly competitive serverless GPU pricing for modern AI inference, including H100, B200, B300-class workloads and more. fal supports state-of-the-art hardware for inference and compute, including B300, B200, H200, H100, H100 MIG, RTX PRO 6000, A100 80GB, L40S/L40. For the latest pricing, contact sales for a quote tailored to your model, traffic pattern, latency target, and business needs.

How am I billed for Serverless?

You are billed per-second for the total time your runners are alive, at the rate for your chosen machine type. This includes setup(), idle time (including keep_alive), active request processing, draining, and teardown. You are not billed for pending time or container image pulls. See Serverless Pricing for the full breakdown by runner state.

How easy is it to migrate from another platform?

If you already have a working Docker container or a Python inference server, migrating to fal is straightforward. You can bring your existing Dockerfile directly with custom container images, or wrap your model in a fal.App class with setup() and endpoint methods. fal has step-by-step migration guides for Replicate, Modal, RunPod, and generic Docker servers.

Full FAQ

Every model on fal runs on fal Serverless. Three years of production. Billions of inferences. The runtime you'd deploy is the runtime we bet our company on every day.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

The inference platform for genmedia

Try live endpoints, built on Serverless

Everything you need to operate at scale

Pay for what runs

Inference infrastructure that keeps pace with AI

Optimized in production, every day

Evergreen by necessity

Trusted by teams building what's next

Purpose-built for every media modality

Video

Realtime

Comfy UI

World-Models

3D

LoRA Training

Built by fal to run fal

Inference runtime is born

100+ models deployed

Expanded to realtime and multimodal

Multi-GPU and next-gen hardware

World Model Accelerator

FAQ

What is serverless inference?

What’s the best platform to deploy custom AI models?

Which platform has the best GPU pricing for H100, B200, and B300 inference?

How am I billed for Serverless?

How easy is it to migrate from another platform?

We built fal Serverless to run fal

Contact form