The inference platform for genmedia

fal Serverless powers 1,300+ production endpoints across thousands of GPUs. Deploy custom workloads with autoscaling, observability, and performance controls built for inference at scale.
Endpoints POST /run Scheduler Routing GPU Pool
Endpoints POST /run Scheduler Routing GPU Pool
Endpoints POST /run Scheduler Routing GPU Pool
Endpoints POST /run Scheduler Routing GPU Pool
GPU usage over time
895 GPUs in use
Requests by status code
Success
2XX
Warning
4XX
Error
5XX
Request traffic
Processed
190.41 req/s
Received
192.92 req/s
Concurrent requests
Last
3,053
Max
4,815
Mean
4,005
Requests by app
GPU usage over time
895 GPUs in use
Requests by status code
Success
2XX
Warning
4XX
Error
5XX
Request traffic
Processed
190.41 req/s
Received
192.92 req/s
Concurrent requests
Last
3,053
Max
4,815
Mean
4,005
Requests by app
GPU usage over time
895 GPUs in use
Requests by status code
Success
2XX
Warning
4XX
Error
5XX
Request traffic
Processed
190.41 req/s
Received
192.92 req/s
Concurrent requests
Last
3,053
Max
4,815
Mean
4,005
Requests by app
GPU usage over time
895 GPUs in use
Requests by status code
Success
2XX
Warning
4XX
Error
5XX
Request traffic
Processed
190.41 req/s
Received
192.92 req/s
Concurrent requests
Last
3,053
Max
4,815
Mean
4,005
Requests by app

Everything you need to operate at scale

Built-in deployment, observability, fault tolerance, and autoscaling in one unified platform.
Deploy to production without managing infrastructure
  • Migrate quickly by bringing your own custom container image.
  • Host model weights on fal’s distributed /data volume with lightning fast reads.
  • fal deploy builds, pushes, warms, and serves your model behind a stable endpoint.
app.py
import fal
from pydantic import BaseModelField
from fal.toolkit import Image
class Input(BaseModel):
    prompt: str = Field(
        description="The prompt to generate an image from",
        examples=["A professional image of a cat"],
    )
class Output(BaseModel):
    image: Image
class ImageGenerator(fal.App):
    app_name = "image-generator"
    machine_type = "GPU-H100"
    min_concurrency=0
    max_concurrency = 20
    requirements = [
        "hf-transfer==0.1.9",
        "diffusers[torch]==0.32.2",
        "transformers[sentencepiece]==4.51.0",
        "accelerate==1.6.0",
    ]
    def setup(self):
        import torch
        from diffusers import StableDiffusionXLPipeline
        self.pipe = StableDiffusionXLPipeline.from_pretrained(
            "stabilityai/stable-diffusion-xl-base-1.0",
            torch_dtype=torch.float16,
            variant="fp16",
            use_safetensors=True,
        ).to("cuda")
# Warmup the model before the first request
        self.warmup()
    def warmup(self):
        self.pipe("A professional image of a cat")
    @fal.endpoint("/")
    def run(self, request: Input) -> Output:
        result = self.pipe(request.prompt)
        image = Image.from_pil(result.images[0])
        return Output(image=image)
fal CLI
fal deploy app.py
✓ registering ImageGenerator · GPU-H100
✓ building image · 2.4 GB (11s)
✓ pushing to fal-registry (3.1s)
✓ deployed → fal.run/acme/image-generator
  scaling 0 → 20 · cold-start 0.41s
app.py
import fal
from pydantic import BaseModelField
from fal.toolkit import Image
class Input(BaseModel):
    prompt: str = Field(
        description="The prompt to generate an image from",
        examples=["A professional image of a cat"],
    )
class Output(BaseModel):
    image: Image
class ImageGenerator(fal.App):
    app_name = "image-generator"
    machine_type = "GPU-H100"
    min_concurrency=0
    max_concurrency = 20
    requirements = [
        "hf-transfer==0.1.9",
        "diffusers[torch]==0.32.2",
        "transformers[sentencepiece]==4.51.0",
        "accelerate==1.6.0",
    ]
    def setup(self):
        import torch
        from diffusers import StableDiffusionXLPipeline
        self.pipe = StableDiffusionXLPipeline.from_pretrained(
            "stabilityai/stable-diffusion-xl-base-1.0",
            torch_dtype=torch.float16,
            variant="fp16",
            use_safetensors=True,
        ).to("cuda")
# Warmup the model before the first request
        self.warmup()
    def warmup(self):
        self.pipe("A professional image of a cat")
    @fal.endpoint("/")
    def run(self, request: Input) -> Output:
        result = self.pipe(request.prompt)
        image = Image.from_pil(result.images[0])
        return Output(image=image)
fal CLI
fal deploy app.py
✓ registering ImageGenerator · GPU-H100
✓ building image · 2.4 GB (11s)
✓ pushing to fal-registry (3.1s)
✓ deployed → fal.run/acme/image-generator
  scaling 0 → 20 · cold-start 0.41s
app.py
import fal
from pydantic import BaseModelField
from fal.toolkit import Image
class Input(BaseModel):
    prompt: str = Field(
        description="The prompt to generate an image from",
        examples=["A professional image of a cat"],
    )
class Output(BaseModel):
    image: Image
class ImageGenerator(fal.App):
    app_name = "image-generator"
    machine_type = "GPU-H100"
    min_concurrency=0
    max_concurrency = 20
    requirements = [
        "hf-transfer==0.1.9",
        "diffusers[torch]==0.32.2",
        "transformers[sentencepiece]==4.51.0",
        "accelerate==1.6.0",
    ]
    def setup(self):
        import torch
        from diffusers import StableDiffusionXLPipeline
        self.pipe = StableDiffusionXLPipeline.from_pretrained(
            "stabilityai/stable-diffusion-xl-base-1.0",
            torch_dtype=torch.float16,
            variant="fp16",
            use_safetensors=True,
        ).to("cuda")
# Warmup the model before the first request
        self.warmup()
    def warmup(self):
        self.pipe("A professional image of a cat")
    @fal.endpoint("/")
    def run(self, request: Input) -> Output:
        result = self.pipe(request.prompt)
        image = Image.from_pil(result.images[0])
        return Output(image=image)
fal CLI
fal deploy app.py
✓ registering ImageGenerator · GPU-H100
✓ building image · 2.4 GB (11s)
✓ pushing to fal-registry (3.1s)
✓ deployed → fal.run/acme/image-generator
  scaling 0 → 20 · cold-start 0.41s
app.py
import fal
from pydantic import BaseModelField
from fal.toolkit import Image
class Input(BaseModel):
    prompt: str = Field(
        description="The prompt to generate an image from",
        examples=["A professional image of a cat"],
    )
class Output(BaseModel):
    image: Image
class ImageGenerator(fal.App):
    app_name = "image-generator"
    machine_type = "GPU-H100"
    min_concurrency=0
    max_concurrency = 20
    requirements = [
        "hf-transfer==0.1.9",
        "diffusers[torch]==0.32.2",
        "transformers[sentencepiece]==4.51.0",
        "accelerate==1.6.0",
    ]
    def setup(self):
        import torch
        from diffusers import StableDiffusionXLPipeline
        self.pipe = StableDiffusionXLPipeline.from_pretrained(
            "stabilityai/stable-diffusion-xl-base-1.0",
            torch_dtype=torch.float16,
            variant="fp16",
            use_safetensors=True,
        ).to("cuda")
# Warmup the model before the first request
        self.warmup()
    def warmup(self):
        self.pipe("A professional image of a cat")
    @fal.endpoint("/")
    def run(self, request: Input) -> Output:
        result = self.pipe(request.prompt)
        image = Image.from_pil(result.images[0])
        return Output(image=image)
fal CLI
fal deploy app.py
✓ registering ImageGenerator · GPU-H100
✓ building image · 2.4 GB (11s)
✓ pushing to fal-registry (3.1s)
✓ deployed → fal.run/acme/image-generator
  scaling 0 → 20 · cold-start 0.41s
Full observability from alert to resolution

See what’s happening across every deployment with logs, request traces, analytics, app events, and latency breakdowns in one dashboard.

Tune every workload for speed and efficiency
Scale to thousands of GPUs instantly

Run large-scale production workloads across H100s, H200s, B200s, B300s, RTX PRO 6000s, and more.

  • Eliminate cold starts by setting min_concurrency to keep a baseline of warm runners.
  • Control burst capacity with max_concurrency and concurrency_buffer to absorb demand spikes.

Pay for what runs

Our Serverless and compute pricing. Find the right plan for your workload.
GPU / Hardware
B300 (288GB)
List Price
$8.50
As low as
$4.49
GPU / Hardware
B200 (180GB)
List Price
$6.25
As low as
$3.49
GPU / Hardware
H200 (141GB)
List Price
$4.50
As low as
$2.10
GPU / Hardware
H100 (80GB)
List Price
$3.99
As low as
$1.89
GPU / Hardware
RTX PRO
6000 (96GB)
List Price
$2.99
As low as
$1.10

Inference infrastructure that keeps pace with AI

Battle-tested

Optimized in production, every day

Every optimization, every reliability improvement, every performance gain gets stress-tested against our own production workloads before it ever reaches yours. We ship to ourselves first.
Elastic

Evergreen by necessity

The AI model landscape moves fast. Because we're continuously onboarding new models and architectures to our own platform, fal serverless is constantly being updated to support new model formats, serving patterns, and hardware optimizations.
Dedicated eng

Trusted by teams building what's next

We've served billions of inference requests across thousands of models. That scale is why Canva, Heygen, Krea and many more chose fal when it mattered most.
1,300+ endpoints in production
Scale to 1000s of GPUs
99.99% uptime SLA
Billions of requests served a year

Video

Realtime

Comfy UI

World-Models

3D

LoRA Training

Built by fal to run fal

2022
2023
2024
2025
2026

Inference runtime is born

We created fal Serverless to run our own inference workloads.

100+ models deployed

Queues, webhooks, caching, logging, and analytics infrastructure turned the runtime into a production platform used to serve models at scale.

Expanded to realtime and multimodal

As AI expanded beyond images, fal Serverless evolved to support realtime apps, audio, video, 3D, containers, and complex workflows.

Multi-GPU and next-gen hardware

Multi-GPU execution and support for H200/B200-class infrastructure enabled larger models, faster video generation, and higher-throughput inference.

World Model Accelerator

A new interface to fal's core primitives, purpose-built for world models.

FAQ

What is serverless inference?

Serverless inference lets you run AI models without managing GPU infrastructure. Traditional serverless platforms focus on general cloud functions, while fal is purpose-built for AI inference with lightning-fast execution, scalability, and enterprise reliability. fal handles GPU provisioning, autoscaling, cold starts, observability, and production deployment so teams can run custom image, video, audio, 3D, and world models with low latency and usage-based pricing. It is ideal for workloads needing bursty demand, fast iteration, extreme latency optimization, and production-scale AI inference. fal's serverless infrastructure doesn't just power our customers' applications, it powers fal itself, every model, every inference call, every workload running on our platform. So the reliability bar here isn't theoretical.

What’s the best platform to deploy custom AI models?

fal is purpose-built for deploying custom AI models in production, especially generative media workloads like image, video, audio, 3D, and world models. Teams use fal Serverless because of their low GPU pricing, fast cold starts, low latency, high throughput, enterprise reliability, and hands-on support from AI infrastructure experts. Today over 2.5 million developers build on fal, and companies like Canva, HeyGen, Krea, Veed, Creatify, Fashn deploy custom AI models on fal serverless. fal processes millions of daily inference calls with 99.99% uptime, and demand continues to accelerate as more developers integrate generative media capabilities into their applications. fal also provides access to more than 1,000 production-ready image, video, audio, and 3D models through a unified API, enabling developers to build and scale generative media applications with enterprise-grade reliability. 

Which platform has the best GPU pricing for H100, B200, and B300 inference?

fal offers highly competitive serverless GPU pricing for modern AI inference, including H100, B200, B300-class workloads and more. fal supports state-of-the-art hardware for inference and compute, including B300, B200, H200, H100, H100 MIG, RTX PRO 6000, A100 80GB, L40S/L40.  For the latest pricing, contact sales for a quote tailored to your model, traffic pattern, latency target, and business needs.

How am I billed for Serverless?

You are billed per-second for the total time your runners are alive, at the rate for your chosen machine type. This includes setup(), idle time (including keep_alive), active request processing, draining, and teardown. You are not billed for pending time or container image pulls. See Serverless Pricing for the full breakdown by runner state.

How easy is it to migrate from another platform?

If you already have a working Docker container or a Python inference server, migrating to fal is straightforward. You can bring your existing Dockerfile directly with custom container images, or wrap your model in a fal.App class with setup() and endpoint methods. fal has step-by-step migration guides for Replicate, Modal, RunPod, and generic Docker servers.

Contact form

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.