Monitor and Scale the C3 AI Model Inference Service

A C3 AI cluster with application(s) requiring the usage of large language models (LLMs), vision language models (VLMs), embedding models, or other large models may require a Model Inference Service to host and serve those models.

The C3 AI Model Inference Service (MIS) is a C3 Agentic AI Platform Microservice for low latency serving of machine learning (ML) models, including LLMs. With C3 AI MIS, you can host any MlAtomicPipe from the C3 AI Model Registry for a "warm" deployment and manage routing of all inference requests.

This topic addresses how to monitor the C3 AI MIS, as well as how to scale nodes using App.NodePools as the inference request volume changes.

Monitor the C3 AI MIS

We can run the ModelInference.summary() API to monitor the underlying engines, threadpools, and threads on the C3 AI Model Inference Service.

Python

# Python
summary = c3.ModelInference.summary()
summary

JavaScript

// JavaScript
var summary = ModelInference.summary()
c3Grid(summary)

It's possible to get more detailed description of the deployment using the following example code snippet.

Python

# Python
# Getting name of the first deployment from the table above.
deployment_name = summary['deployment'][0]

deployment_summary = c3.ModelInference.deploymentSummary(deployment_name)
deployment_summary

JavaScript

// JavaScript
// Getting name of the first deployment from the table above.
var deploymentName = summary['deployment'].get(0)
var deploymentSummary = ModelInference.deploymentSummary(deploymentName)
c3Grid(deploymentSummary)

If more details on the node state is required, it's possible to check state of individual threads using the following example code snippet.

Python

# Python
# Getting ID of the first node from the table above.
node_id = deployment_summary['node'][0]
threads_summary = c3.ModelInference.threadsSummary(deployment_name, node_id)
threads_summary

JavaScript

// JavaScript
// Getting ID of the first node from the table above.
var nodeId = deploymentSummary['node'].get(0)
var threadsSummary = ModelInference.threadsSummary(deploymentName, nodeId)
c3Grid(threadsSummary)

Scale the C3 AI MIS

It may become necessary to scale the service when request volume changes. Currently, this is done manually by changing the number of nodes in the App.NodePool corresponding to the pipe deployment. In the world of LLMs, adding a new node to the App.NodePool corresponding to the pipe deployment is sometimes referred to as "creating a replica of the model," as the new node in the App.NodePool will have its own copy of the model in its GPU memory.

NOTE: Scaling operations must performed from the C3 AI Model Inference service application.

Scaling up

In order to "create a replica," or scale up a deployment, we can run the following to increase the number of nodes in the App.NodePool from one to two.

Python

# Python
nodepool_name = "4xa100falcon80g"
num_nodes = 2

c3.app().nodePool(nodepool_name).setNodeCount(num_nodes, num_nodes, num_nodes).update()

JavaScript

// JavaScript
var nodepoolName = "4xa100falcon80g"
var numNodes = 2

C3.app().nodePool(nodepoolName).setNodeCount(numNodes, numNodes, numNodes).update()

You can monitor progress using ModelInference.summary(), ModelInference.deploymentSummary(name) or using App.NodePools APIs.

Scaling down

Scaling down is similar to scaling up. In order to scale down the deployment to one node, run the following:

Python

# Python
nodepool_name = "4xa100falcon80g"
num_nodes = 1

c3.app().nodePool(nodepool_name).setNodeCount(num_nodes, num_nodes, num_nodes).update()

JavaScript

// JavaScript
var nodepoolName = "4xa100falcon80g"
var numNodes = 1

C3.app().nodePool(nodepoolName).setNodeCount(numNodes, numNodes, numNodes).update()

Copy link to this sectionMonitor the C3 AI MIS

Copy link to this sectionScale the C3 AI MIS

Copy link to this sectionScaling up

Copy link to this sectionScaling down

Copy link to this sectionSee also

Monitor the C3 AI MIS

Scale the C3 AI MIS

Scaling up

Scaling down

See also