Documentation
Private Serverless Models on GPUs
Introduction

Introduction to Private Serverless Models

In addition to using our blazing-fast public API endpoints you can also take advantage of fal's infrastructure for your private AI models. This section explains how to deploy a custom private AI model to fal's infrastructure.

Defining a Function, Leveraging Python's rich ecosystem

A function represents the simplest unit of code that can be deployed to the fal runtime.

Let's start with a fun example, using pyjokes inside one of our serverless functions. fal functions can leverage python's powerful package ecosystem.

import fal
 
 
@fal.function(
    "virtualenv",
    requirements=["pyjokes"],
)
def tell_joke() -> str:
    import pyjokes
 
    joke = pyjokes.get_joke()
    return joke
 
 
print("Joke from the clouds: ", tell_joke())

A new virtual environment will be created by fal in the cloud and the set of requirements that we passed will be installed as soon as this function is called. From that point on, our code will be executed as if it were running locally, and the joke prepared by the pyjokes library will be returned.

One thing that you might have noticed is that we imported the pyjokes library under the function definition as opposed to on top of the file.

Since pyjokes is not necessary in the local Python environment, the snippet above will work even if pyjokes is not installed in the local environment. However, the dependency pyjokes can still be used when using remote functions. This is particularly important when dealing with complex dependency chains.

Running basic AI Models

Now that we know how to leverage an existing Python package, we can start using transformers (opens in a new tab) to run a simple ML workflow such as text classification, all in the cloud with no infrastructure!

import fal
 
# Can be any model from HF's model hub, see https://huggingface.co/models
TEXT_CLASSIFICATION_MODEL = "distilbert-base-uncased-finetuned-sst-2-english"
 
 
@fal.function(
    "virtualenv",
    requirements=["transformers", "datasets", "torch"],
    machine_type="M",
)
def classify_text(text: str) -> tuple[str, float]:
    from transformers import pipeline
 
    pipe = pipeline("text-classification", model=TEXT_CLASSIFICATION_MODEL)
    [result] = pipe(text)
 
    return result["label"], result["score"]
 
 
if __name__ == "__main__":
    sentiment, confidence = classify_text("I like apples.")
    print(
        f"Sentiment of the subject prompt: {sentiment!r} "
        f"with a confidence of {confidence}"
    )

One new concept you might have noticed is the new machine_type annotation which denotes where to run your workflow. Since this is an ML inference model, we choose an M tier machine (which is much more compute intensive compared to the default XS machines). Running our workflow in one of these machines takes about ~15 seconds on the initial invocation (after the environment has been built).

Faster subsequent invocations

Each time you invoke a fal function; depending on its properties (such as its environment, its machine type etc.), the fal runtime automatically provisions a new machine for you under the hood in the cloud, runs your function, returns you the result and voids* that machine.

This is useful for cost saving when the usage is sparse. If your traffic patterns include subsequent invocations, it might be a good idea to keep the new machines around just a little bit longer. This reduces the potential side-effects of new cold starts (machine provisioning time, and more importantly being able to keep the same Python process around).

import fal
 
# Can be any model from HF's model hub, see https://huggingface.co/models
TEXT_CLASSIFICATION_MODEL = "distilbert-base-uncased-finetuned-sst-2-english"
 
 
@fal.function(
    "virtualenv",
    requirements=["transformers", "datasets", "torch"],
    machine_type="M",
    keep_alive=60,
)
def classify_text(text: str) -> tuple[str, float]:
    from transformers import pipeline
 
    pipe = pipeline("text-classification", model=TEXT_CLASSIFICATION_MODEL)
    [result] = pipe(text)
    return result["label"], result["score"]
 
 
if __name__ == "__main__":
    for prompt in [
        "I like apples.",
        "I hate oranges.",
        "I have mixed feelings about pineapples.",
    ]:
        sentiment, confidence = classify_text(prompt)
        print(
            f"Sentiment of the subject prompt: {sentiment!r} "
            f"with a confidence of {confidence}"
        )

After setting a keep_alive on the function, we can see that our invocations went from ~15 seconds to ~3 seconds each. Almost a 5x speed-up thanks to just being able to keep the same Python modules in memory.


2023 © Features and Labels Inc.