WebSocket-based inference for ultra-low latency applications
Real-time inference uses WebSockets for persistent connections, enabling sub-100ms image generation. This is ideal for interactive applications like real-time creativity tools and camera-based inputs.Unlike queue-based inference, real-time connections bypass the queue entirely and route inputs directly to a runner. This eliminates queue wait time, and because the WebSocket maintains a persistent connection, the runner stays warm for all subsequent messages after the initial connection. The first connection may still incur a cold start if no runner is already available. Only models with an explicit real-time endpoint are supported.
Only models that explicitly support real-time inference can be used with the realtime client. Standard queue-based models do not have a realtime endpoint.
WebSocket connections from browsers cannot safely embed API keys. There are two approaches for client-side authentication: a proxy URL or a token provider.
For more control, use a tokenProvider function that fetches short-lived JWT tokens from your backend. This is useful when you need per-user authentication or want to restrict which apps a token can access.
Protect your token endpoint with authentication. The endpoint that generates fal tokens should verify that the request comes from an authenticated user in your application. Without proper authentication, anyone could use your endpoint to generate tokens and consume your fal credits.
Client-side example:
Report incorrect code
Copy
Ask AI
import { fal, type TokenProvider } from "@fal-ai/client";// app includes the full endpoint path, e.g. "fal-ai/fast-lcm-diffusion/realtime"const myTokenProvider: TokenProvider = async (app) => { const response = await fetch(`/api/fal/token?app=${encodeURIComponent(app)}`); const { token } = await response.json(); return token;};const connection = fal.realtime.connect("fal-ai/fast-lcm-diffusion", { tokenProvider: myTokenProvider, tokenExpirationSeconds: 120, // match the duration from your backend onResult: (result) => { console.log(result); },});connection.send({ prompt: "a cat", sync_mode: true,});
Pass tokenExpirationSeconds to enable automatic token refresh before expiry. Set it to the same value as the duration in your backend’s token request. If omitted, auto-refresh is disabled and your tokenProvider is called once at connection time.Next.js API Route example (app/api/fal/token/route.ts):
Report incorrect code
Copy
Ask AI
import { NextRequest, NextResponse } from "next/server";export async function GET(request: NextRequest) { // IMPORTANT: Add your own authentication logic here // const session = await getServerSession(); // if (!session) { // return NextResponse.json({ error: "Unauthorized" }, { status: 401 }); // } const { searchParams } = new URL(request.url); const app = searchParams.get("app"); if (!app) { return NextResponse.json({ error: "Missing app parameter" }, { status: 400 }); } const response = await fetch("https://rest.fal.ai/tokens/realtime", { method: "POST", headers: { "Content-Type": "application/json", Authorization: `Key ${process.env.FAL_KEY}`, }, // app includes the full path (e.g. "fal-ai/fast-lcm-diffusion/realtime") body: JSON.stringify({ allowed_apps: [app], duration: 120, }), }); const data = await response.json(); return NextResponse.json({ token: data.token });}
The tokenProvider also works for streaming with connectionMode: "client":
Real-time WebSocket connections bypass the queue and connect directly to a runner. Several request parameters that work with queue-based inference do not apply:
Parameter
Behavior with Real-Time
start_timeout
No effect. There is no queue wait
priority
No effect. No queue ordering
webhook_url
Not supported. Results stream back over the WebSocket
Automatic retries
Not available. Failed messages return errors on the connection
By default, the realtime client connects to the /realtime path on the app (e.g., wss://fal.run/fal-ai/my-app/realtime). If your app exposes a realtime endpoint at a different path, use the path option:
Both realtime and streaming give you faster feedback than polling, but they serve different use cases.
Feature
Realtime (WebSocket)
Streaming (SSE)
Direction
Bidirectional (client and server)
One-way (server to client)
Connection
Persistent, reusable
New connection per request
Latency
Lower (connection reuse)
Higher (new connection each time)
Best for
Interactive apps, back-to-back requests
Progressive output, previews
Protocol
Binary msgpack (default, customizable)
JSON over SSE
Use realtime when clients send multiple requests in quick succession over a persistent connection, like interactive image editing or camera-based inputs. Use streaming when you want to show progressive output from a single request, like image generation previews or LLM tokens.
The realtime client uses msgpack for binary serialization by default across all SDKs, which is more efficient than JSON for transmitting image data. In Python, realtime() and realtime_async() provide a RealtimeConnection with send() and recv() methods. In JavaScript, fal.realtime.connect() uses callback-based onResult and onError handlers.In the JavaScript client, you can customize the message encoding by passing encodeMessage and decodeMessage options. For example, to use JSON instead of msgpack: