C3 AI Documentation Home

Create and Deploy a VllmPipe

A C3 AI cluster with application(s) requiring the usage of large language models (LLMs), vision language models (VLMs), embedding models, or other large models may require a Model Inference Service to host and serve those models.

The C3 AI Model Inference Service (MIS) is a C3 Agentic AI Platform Microservice for low latency serving of machine learning (ML) models, including LLMs. With C3 AI MIS, you can host any MlAtomicPipe from the C3 AI Model Registry for a "warm" deployment and manage routing of all inference requests.

This topic addresses how to create and register a VllmPipe to the C3 AI Model Registry service, create a App.NodePool, and deploy the VllmPipe to the specified NodePool. These steps are critical to server LLMs for text generation on the C3 Agentic AI Platform.

See also Use C3 AI MIS for LLM Text Generation and Inference Requests for next steps after setting up the model serving in the C3 Agentic AI Platform.

Overview of model serving on the C3 Agentic AI Platform

In order to serve LLM for text generation on the C3 Agentic AI Platform, you need to:

  • Create a VllmPipe
  • Register the VllmPipe to the C3 AI Model Registry Service
  • Deploy the VllmPipe
  • Set an access route

This topic addresses the first three. See Manage Routes to Change or Upgrade LLMs for details about setting an access route, as well as managing and deploying routes.

Create a VllmPipe

A VllmPipe can be created by using model files you have downloaded or by loading them from Hugging Face Hub have downloaded.

See the vLLM Supported Models website for information about supported models.

In Jupyter, create a VllmPipe using any of the following sources:

  • A Model ID
  • A Local path
  • A Remote URL

Using these three methods, you should pass in the argument to the tensorParallelSize. Ensure that this parameter aligns with the number of GPUs in the App.NodePool. Utilizing more GPUs enables the accommodation of larger models on smaller GPUs (e.g., running Falcon-40b on 4xL4) and supports a higher number of concurrent users by enhancing the overall tokens per second. However, it may slightly elevate the latency or reduce the tokens per second for a single request.

For testing purposes, you may want to deploy a model from HuggingFace Hub. We have the option to create a VllmPipe using the HuggingFace Model ID. This is not recommended in production both because this depends on the availability of HuggingFace Hub servers and because it is a potential security vulnerability:

Python
modelId = "tiiuae/falcon-40b"
pipe = c3.VllmPipe.fromModelId(modelId, tensorParallelSize=4, trustRemoteCode=True)
pipe

Creating a VllmPipe from a local path or remote URL

If you have model files locally, you can create the VllmPipe by specifying the local_path for the directory containing the model files and running the following:

Python
local_path = 'path/to/model/'
pipe = c3.VllmPipe.fromLocalPath(local_path, tensorParallelSize=4)

If the model files are stored remotely, you can create the VllmPipe by specifying the remote_url, the path containing the model files, and running the following:

Python
remote_url = 'gcs://c3--datasets/genai/models/c3-penguin/'  # Replace this with the URL to your model files
pipe = c3.VllmPipe.fromRemoteUrl(remote_url, tensorParallelSize=4, trustRemoteCode=True)

Register the VllmPipe to the C3 AI Model Registry

To serve an LLM on the C3 Agentic AI Platform, a VllmPipe associated with an LLM must be registered to the C3 AI Model Registry Service.

If Model Registry is not yet configured, the following code snippet can be used from Jupyter:

Python
registryServiceApp = c3.App.forName("registryservice")  # Replace "registryservice" if using a different model registry service than the default
c3.ModelRegistry.setServiceAppId(registryServiceApp.id)

or from the console of the app:

JavaScript
// JavaScript
registryServiceApp = App.forName("registryservice")  // Replace "registryservice" if using a different model registry service than the default
ModelRegistry.setServiceAppId(registryServiceApp.id)

See more information in Create and Configure the C3 AI Model Inference Service

This pipe registration must be done from one of the following applications depending on the choice of architecture:

  • The pipe registration application (Architecture 1)
  • The Model Inference service application (Architecture 2)
Python
c3.ModelRegistry.registerMlPipe(pipe, "falcon40b", "My Falcon40b Pipe!")

See Overview of C3 AI MIS Administration for details regarding architectural design recommendations and package dependencies.

Deploy the VllmPipe

Create App.NodePool for VllmPipe

To deploy the pipe, create a App.NodePool to which you can deploy it.

This should be done from the service application. Nodes of this App.NodePool will be used to keep models warm.

It's recommended to have one App.NodePool per deployed pipe; otherwise, warm models will compete for resources, such as GPU and memory.

Note: Depending on your cluster, you might have access to different hardware profiles. This is just an example. Check with your cluster's administrator to see which hardware is available. You can run HardwareProfile.listConfigs().collect() to list all uploaded hardware profiles.

Python
# Python
hwProfile = c3.HardwareProfile.upsertProfile({
    "name": '4x40a100_40cpu_600mem',
    "cpu": 40,
    "memoryMb": 600_000,
    "gpu": 4,
    "gpuKind": 'nvidia-a100-40gb-8',
    "gpuVendor": 'nvidia',
    "diskGb" : 500  # Recommended to replace with 2 * <MODEL FILE SIZE>
})
    
name = '4xa100falcon80g' # name must be lowercase alphanumeric

JVM_MIN_MEM = 16_000  # JVM needs 16GB RAM for c3server
TOTAL_vRAM_GB = 80 * 4  # Insert total vRAM across all GPUs
assert hwProfile.memoryMb >= JVM_MIN_MEM + TOTAL_vRAM_GB  # See RAM Requirements section

c3.app().configureNodePool(
    name,                             # name of the node pool to configure 
    1,                                # sets the target node count
    1,                                # sets the minimum node count
    1,                                # sets the maximum node count
    hwProfile,                        # sets the hardware profile
    [c3.Server.Role.SERVICE],         # sets the server role that this node pool will function as                  
    False,                            # optional - specifies whether autoscaling should be enabled
    JVM_MIN_MEM / hwProfile.memoryMb  # optional - percentage of RAM to reserve for JVM
).update()
JavaScript
// JavaScript
hwProfile = HardwareProfile.upsertProfile({
  "name": '4x40a100_40cpu_600mem',
  "cpu": 40,
  "memoryMb": 600_000,
  "gpu": 4,
  "gpuKind": 'nvidia-a100-40gb-8',
  "gpuVendor": 'nvidia',
  "diskGb": 500  // Recommended to replace with 2 * <MODEL FILE SIZE>
})

name = '4xa100falcon80g'  // name must be lowercase alphanumeric

JVM_MIN_MEM = 16_000  // JVM needs 16 GB RAM for c3server
TOTAL_vRAM_GB = 80 * 4  // Insert total vRAM across all GPUs

console.assert(hwProfile.memoryMb >= JVM_MIN_MEM + TOTAL_vRAM_GB)  // See RAM Requirements section

C3.app().configureNodePool(
    name,                             // name of the node pool to configure
    1,                                // sets the target node count
    1,                                // sets the minimum node count
    1,                                // sets the maximum node count
    hwProfile,                        // sets the hardware profile
    [Server.Role.SERVICE],            // sets the server role that this node pool will function as
    false,                            // optional - specifies whether autoscaling should be enabled
    JVM_MIN_MEM / hwProfile.memoryMb  // optional - percentage of RAM to reserve for JVM
).update()

See also Configure and Manage Node Pools for more information.

Retrieve and deploy the VllmPipe

We retrieve the latest entry for the "falcon40b" URI from the C3 AI Model Registry Service. To deploy the pipe successfully, this must be performed in the C3 AI MIS application.

Pipe deployment and route management should be executed from the service app.

Python
# Python
vers = c3.ModelRegistry.listVersions(None, filter="contains(uri, 'falcon40b/1')").objs
entry = vers[0]
entry

or

JavaScript
// JavaScript
vers = ModelRegistry.listVersions(null, {filter: "contains(uri, 'falcon40b/1')"}).objs
entry = vers[0]
entry

We use the deploy() API to deploy the entry to the App.NodePool we created.

Python
# Python
c3.ModelInference.deploy(entry, nodePools=["8xa100falcon80g"])

or

JavaScript
// JavaScript
ModelInference.deploy(entry, {nodePools: ["8xa100falcon80g"]})

NOTE: It is not recommended to re-deploy the same pipe with a different configuration as this may cause inconsistencies within the deployment. Please re-register pipes in the C3 AI Model Registry Service to avoid this issue or use a fresh App.NodePool.

Finally, we set a route that can be used to access this deployment.

Python
# Python
c3.ModelInference.setRoute(entry, "qa-model-falcon40b")

or

JavaScript
// JavaScript
ModelInference.setRoute(entry, "qa-model-falcon40b")

Now the client application is able to use the ModelInference.completion() API with this route to request test generation from this Falcon-40B LLM deployment. To test this, try running the following line from the client application in Jupyter:

Python
# Python
c3.ModelInference.completion(route="qa-model-falcon40b", prompts=["hello"], params={'max_tokens': 128})

Or from the console of the client application:

JavaScript
// JavaScript
ModelInference.completion("qa-model-falcon40b", ["hello"], {max_tokens: 128})

Note that for testing, the c3.ModelInferenceService.completion() API is able to be called from the Model Inference service. Please note that this is different from the c3.ModelInference.completion() API, which is called from the client. It is best practice to run c3.ModelInferenceService.completion() from the Model Inference service application immediately after deploying a model to ensure that the model loading process begins immediately.

See also

Was this page helpful?