The C3 AI Model Inference Service (MIS) is a C3 Agentic AI Platform Microservice for low latency serving of machine learning (ML) models, including LLMs. With C3 Agentic AI MIS, you can serve any MlAtomicPipe from the C3 Agentic AI Model Registry for a "warm" deployment and manage routing of all inference requests. The MIS UI allows users to serve VllmPipe models from the Studio Managed Model Registry of an application. StudioAdmin users can also manage deployed instance from the Model Inference Microservices section of the Admin Services Tab.

For more information about MIS please see Overview of C3 AI Model Inference Service Administration.

Serving an LLM

The entry point for the deployment process in the Model Registry of an app. You will select the VllmPipe you wish to serve, and click on the row's corresponding Serve Model button (the lightning bolt symbol). If you try to serve a non-VllmPipe, you will see a modal telling you that model type is not supported via the UI.

Model Registry Page

Configure MIS Hardware Constraints

The first page you'll see after clicking the Serve button is the Hardware Configuration page. This page will contain options to select the hardware to host the model on, and the model's Model Card to the right of the options. The specific hardware options are:

GPU Type - the GPU model to use (eg. A100, T4, L4, etc.). The dropdown menu will only display GPUs available to the app.
GPU - the number of GPUs of the GPU type to use. If you try to select more GPUs than the cluster has access to you will see an error banner.
CPU - the number of CPUs to use for the deployment
Memory (GB) - the amount of RAM to use for the deployment. Note, it is recommended to have at least 2x the model size (in GB) worth of RAM for a deployment.

Additional validation prevents you from advancing if there are errors. The current validation checks the following information for your hardware request:

Cluster availability
Compatibility with the desired vLLM Deployment args

Deployment Validation Modals

After you select your hardware configuration, select Continue to move to the Model Configuration page.

Model Hardware Config

Configure MIS Model Configurations

The only required input for the Model Configuration page is the Deployment Name. Enter the name your model will have on the Model Inference Microservices page.

If you need to update the vLLM Deployment Args, you can configure these in the Advanced section.

Advanced settings can impact the performance of your deployment. Only configure these settings if you know what the parameters do.

The modal expects JSON formatted data, and will not allow you to move forward if you enter improperly formatted code. To edit the modal, select the pen and paper icon. To revert to the default configuration select the refresh icon.

Once you are done with this page, you can select the Back button to return to the Hardware Configuration page or the Submit button to initiate the deployment. Any changes to the Deployment Args will be saved if you select the Back button.

Model Config

Managing a Deployed Model

After you deployed a model, a modal appears to notify you of the status of the MIS microservice.

Model Deployment Modals

Studio Admin Permissions

If you are a ClusterAdmin, you can view all MIS instances, deployed via the UI, in the Services tab of the Admin page. If there are any instances of models deployed via the MIS UI, you can also find a Model Inference Microservices tab. This tab allows you to do the following:

Inspect the status of the microservice
Terminate the microservice
View instructions of the microservice

MIS Services

Select the service name to open a markdown file that covers the following:

How users can access the model
How to validate the model runs as expected
How to whitelist different clusters so they can access the model

Model Information

Deploying a Model from the UI vs Code

When serving a model via the Studio UI there are a few considerations to know about:

Only VllmPipes can be served via the StudioUI
All deployments initiated from the UI will create their own app to host the MIS instance. Therefore, there is only one route per MIS app. If serving models via MIS from code, the standard approach is for multiple models (routes) to be deployed on one app.
When selecting GPUs, the number of GPUs specified in the UI is the total GPUs. When selecting GPUs via code, the number of GPUs is the number of node pools you a requesting.
Only deployments initiated via the Studio UI will be present in the Model Inference Microservices

Copy link to this sectionServing an LLM

Copy link to this sectionConfigure MIS Hardware Constraints

Copy link to this sectionConfigure MIS Model Configurations

Copy link to this sectionManaging a Deployed Model

Copy link to this sectionStudio Admin Permissions

Copy link to this sectionDeploying a Model from the UI vs Code