Documentation
Private Serverless Models on GPUs
Deploy a Function as API Endpoints

Serving a function

In this section we will learn how to deploy a function as a web endpoint. This will allow you to call your function from any language that supports HTTP requests.

Python functions to Web Endpoints

Any FAL function can be turned into a production-ready web endpoint with a single line of configuration change. When the serve=True option is added to the @fal.function decorator, FAL wraps the functions with a Fast API (opens in a new tab) web server. Similar to fal function this webserver runs serverlessly, and scales down to zero when it is not actively used.

import fal
 
@fal.function(
    "virtualenv",
    requirements=["pyjokes"],
    serve=True,
)
def tell_a_joke() -> str:
    import pyjokes
 
    joke = pyjokes.get_joke()
    return joke

Deploying a basic app through the CLI

This function can be deployed a serverless web endpoint by running the following command

fal deploy ./path/to/tell_joke.py::tell_joke --app-name joke

You'll receive an revision ID in the following format: XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX. This is the revision id of your deployed serverless app. Every time you call the fal deploy command a new revision id will be generated. We will keep the old revisions around so you can still access them.

Deploying a function with the --app-name option will create a url that includes the application name you specified instead of the revision id. If you deploy a new revision with the same application name, the url will point to the most recent revision of the function.

>> Registered a new revision for function 'joke'  (revision='21847a72-93e6-4227-ae6f-56bf3a90142d').
>> URL: https://fal.run/000000/joke

Passing arguments and leveraging Pydantic

fal functions and FAST API are fully compatible with Pydantic. Any features of Pydantic used in function arguments will also work.

Pydantic features can be used for data validation in your function. In the example below, you can set some of the parameters as optional, set default values, and apply other types of validation such as constraints and types.

inference_image_file_serve.py
import fal
 
from pydantic import BaseModel, Field
from fal.toolkit import Image
 
MODEL_NAME = "google/ddpm-cat-256"
 
 
class ImageModelInput(BaseModel):
    seed: int | None = Field(
        default=None,
        description="""
            The same seed and the same prompt given to the same version of Stable Diffusion
            will output the same image every time.
        """,
        examples=[176400],
    )
    num_inference_steps: int = Field(
        default=25,
        description="""
            Increasing the amount of steps tell the model that it should take more steps
            to generate your final result which can increase the amount of detail in your image.
        """,
        gt=0,
        le=100,
    )
 
 
@fal.function(
    requirements=[
        "diffusers[torch]",
        "transformers",
        "pydantic<2",
    ],
    machine_type="GPU-A100",
    keep_alive=60,
    serve=True,
)
def generate_image(input: ImageModelInput) -> Image:
    import torch
    from diffusers import DDPMPipeline
 
    pipe = DDPMPipeline.from_pretrained(MODEL_NAME, use_safetensors=True)
    pipe = pipe.to("cuda")
    result = pipe(
        num_inference_steps=input.num_inference_steps,
        generator=torch.manual_seed(input.seed or torch.seed()),
    )
    return Image.from_pil(result.images[0])

Running functions

Since deployed functions become web endpoints that expect a POST request with a JSON body and it will return the JSON representation of the result of the function, any language with support to HTTP requests can call a deployed function, using their standard or community provided libraries.

Let's see an example of how to call the generate_image function from a few popular methods:

curl --request POST \
  --url https://fal.run/fal-ai/fast-lightning-sdxl \
  --header "Authorization: Key $FAL_KEY" \
  --header 'Content-Type: application/json' \
  --data '{ "prompt": "a cute puppy" }'

Running a Web Endpoint Function through SDK

For fast iteration during development, it is advisable to invoke the fal functions directly from your local Python environment. You can do this by passing the serve=False option to the on method of the fal function. That will return a new function reference that you can call directly without the need to publish the function as an endpoint.

Using the previous generate_image example, add this to the end of the file:

inference_image_file_serve.py
if __name__ == "__main__":
    generate_image_through_sdk = generate_image.on(
        serve=False,
        keep_alive=None,
    )
    cat_image = generate_image_through_sdk(input=ImageModelInput(seed=176400))
    print(f"Here is your cat: {cat_image.url}")

Now you can execute it like any other python file in your local environment. Note that in this example we not only changed serve to False, but also set keep_alive to None. This is because the keep alive option in this case is only relevant when the function is deployed as an endpoint. You may have use cases where you want to run subsequent tests locally and keep the function alive, so tweak it as needed.

Authentication

By default, each registered function is private. To access the web endpoint, all requests need to be authenticated.

The simplest way to authenticate is by creating keys and using them as an Authentication Header in the request. If this is your first time accessing a web endpoint, navigate to [Key management](https://fal.ai/dashboard/keys) to create keys.

curl -X POST https://fal.run/1714827/joke \
 -H "Authorization: Key $FAL_KEY"

Public Web Endpoints

Alternatively, you can mark your web endpoint as public. When an endpoint is marked as public, the authentication step provided by fal is skipped, and your endpoint is publicly accessible on the internet.

fal deploy ./path/to/server.py::tell_a_joke --app-name joke --auth public

Checking Logs

Web endpoint logs can be accessed via fal CLI and the logs tab of the dashboard. Visit the Logs Viewer (opens in a new tab) on the fal dashboard.

Scaling Web Endpoints

You can configure the maximum number of concurrent apps that can exist simultaneously using the max_concurrency property of the fal.function decorator. By default, this property is set to 2.

import fal
 
@fal.function("virtualenv",
    requirements=["pyjokes"],
    serve=True,
    max_concurrency=5,
)
def tell_a_joke() -> str:
    import pyjokes
 
    joke = pyjokes.get_joke()
    return joke

2023 © Features and Labels Inc.