Index - fal docs

Introduction to Private Serverless Models

In addition to using our blazing-fast public API endpoints you can also take advantage of fal's infrastructure for your private AI models. This section explains how to deploy a custom private AI model to fal's infrastructure.

Enterprise

Private Models are an Enterprise feature, please email [email protected] to get access.

Defining a Function, Leveraging Python's rich ecosystem

A function represents the simplest unit of code that can be deployed to the fal runtime.

Let's start with a fun example, using pyjokes inside one of our serverless functions. fal functions can leverage python's powerful package ecosystem.

import fal
 
 
@fal.function(
    "virtualenv",
    requirements=["pyjokes"],
)
def tell_joke() -> str:
    import pyjokes
 
    joke = pyjokes.get_joke()
    return joke
 
 
print("Joke from the clouds: ", tell_joke())

A new virtual environment will be created by fal in the cloud and the set of requirements that we passed will be installed as soon as this function is called. From that point on, our code will be executed as if it were running locally, and the joke prepared by the pyjokes library will be returned.

One thing that you might have noticed is that we imported the pyjokes library under the function definition as opposed to on top of the file.

Since pyjokes is not necessary in the local Python environment, the snippet above will work even if pyjokes is not installed in the local environment. However, the dependency pyjokes can still be used when using remote functions. This is particularly important when dealing with complex dependency chains.

Implementation note

When using 3rd party objects as inputs or outputs for fal functions, be aware that both your local computer and the fal serverless runtime has to have the same exact set of dependencies.

Running basic AI Models

Now that we know how to leverage an existing Python package, we can start using transformers (opens in a new tab) to run a simple ML workflow such as text classification, all in the cloud with no infrastructure!

import fal
 
# Can be any model from HF's model hub, see https://huggingface.co/models
TEXT_CLASSIFICATION_MODEL = "distilbert-base-uncased-finetuned-sst-2-english"
 
 
@fal.function(
    "virtualenv",
    requirements=["transformers", "datasets", "torch"],
    machine_type="M",
)
def classify_text(text: str) -> tuple[str, float]:
    from transformers import pipeline
 
    pipe = pipeline("text-classification", model=TEXT_CLASSIFICATION_MODEL)
    [result] = pipe(text)
 
    return result["label"], result["score"]
 
 
if __name__ == "__main__":
    sentiment, confidence = classify_text("I like apples.")
    print(
        f"Sentiment of the subject prompt: {sentiment!r} "
        f"with a confidence of {confidence}"
    )

One new concept you might have noticed is the new machine_type annotation which denotes where to run your workflow. Since this is an ML inference model, we choose an M tier machine (which is much more compute intensive compared to the default XS machines). Running our workflow in one of these machines takes about ~15 seconds on the initial invocation (after the environment has been built).

Faster subsequent invocations

Each time you invoke a fal function; depending on its properties (such as its environment, its machine type etc.), the fal runtime automatically provisions a new machine for you under the hood in the cloud, runs your function, returns you the result and voids* that machine.

This is useful for cost saving when the usage is sparse. If your traffic patterns include subsequent invocations, it might be a good idea to keep the new machines around just a little bit longer. This reduces the potential side-effects of new cold starts (machine provisioning time, and more importantly being able to keep the same Python process around).

import fal
 
# Can be any model from HF's model hub, see https://huggingface.co/models
TEXT_CLASSIFICATION_MODEL = "distilbert-base-uncased-finetuned-sst-2-english"
 
 
@fal.function(
    "virtualenv",
    requirements=["transformers", "datasets", "torch"],
    machine_type="M",
    keep_alive=60,
)
def classify_text(text: str) -> tuple[str, float]:
    from transformers import pipeline
 
    pipe = pipeline("text-classification", model=TEXT_CLASSIFICATION_MODEL)
    [result] = pipe(text)
    return result["label"], result["score"]
 
 
if __name__ == "__main__":
    for prompt in [
        "I like apples.",
        "I hate oranges.",
        "I have mixed feelings about pineapples.",
    ]:
        sentiment, confidence = classify_text(prompt)
        print(
            f"Sentiment of the subject prompt: {sentiment!r} "
            f"with a confidence of {confidence}"
        )

After setting a keep_alive on the function, we can see that our invocations went from ~15 seconds to ~3 seconds each. Almost a 5x speed-up thanks to just being able to keep the same Python modules in memory.

Implementation note

There is a default keep_alive value of 30 seconds for each fal function, so all invocations will be kept around for 30 more seconds after their initial call to ensure best performance is achieved.

Workflows Accessing Persistent Storage