Use C3 AI Model Inference Service for LLM Text Generation and Inference Requests

A C3 AI cluster with application(s) requiring the usage of large language models (LLMs), vision language models (VLMs), embedding models, or other large models may require a Model Inference Service to host and serve those models.

The C3 AI Model Inference Service (MIS) is a C3 Agentic AI Platform Microservice for low latency serving of machine learning (ML) models, including LLMs. With C3 AI MIS, you can host any MlAtomicPipe from the C3 AI Model Registry for a "warm" deployment and manage routing of all inference requests.

This topic addresses how to use the C3 AI MIS for LLM text generation or for generic inference requests.

See also Create and Deploy a VllmPipe for details about setting up model serving of LLMs for text generation on the C3 Agentic AI Platform.

NOTE: this guide assumes the commands are executed from the C3 AI JupyterLab.

Use C3 AI MIS for LLM text generation

The c3.ModelInference.completion() API is the primary way to perform LLM text generation on the C3 Agentic AI Platform.

This section provides details for using the c3.ModelInference.completion() API, including setting up the inputs to generate text responses.

The inputs to the completion API are as follows:

route - string that determines which LLM will be used to generate the responses for your prompts. The route input accepts a string that refers to a specific route that the Model Inference Service application administrator has set up for you to use. The route input is often a string like opt-125m, which would refer to the opt-125m model from Facebook.
prompts - list of strings - 'prompts' for which the LLM will generate responses.
params - map of string to any that determines model-specific parameters that will be used to generate the responses for your prompts.

See the following for an example code snippet of a completed request:

Python

# Define route
route = 'opt-125m'

# Define prompts
prompt_1 = "Respond to this question as if you were a computer scientist: What is the difference between interpreted and compiled programming languages?"
prompt_2 = "Respond to this question as if you were a data scientist: What is the difference between classification and regression?"
prompts = [prompt_1, prompt_2]

# Define params
params = {
    'max_tokens' : 128,
    'temperature' : 0.5,
    'n' : 2
}

# Request LLM responses
res = c3.ModelInference.completion(route=route, prompts=prompts, params=params)
print(res[0]['outputs'][0]['text'])

Set completion routes

The route determines which LLM is used to generate the responses for your prompts. You can check which routes have been made available for you by the C3 AI MIS application administrator by running the following code snippet.

Python

list(c3.ModelInference.listRoutes())

An example of the output is as follows.

Python

['opt-125m']

We can see that the route opt-125m is available to generate responses. Thus, we will set our route.

Python

route = 'opt-125m'

Define the prompts

Since the completion API accepts a list of strings as the prompts for which the LLM will generate responses, we assemble our list of prompts.

NOTE: It is not recommended to provide more than 10 prompts in one call to the completion API.

Python

prompt_1 = "Respond to this question as if you were a computer scientist: What is the difference between interpreted and compiled programming languages?"
prompt_2 = "Respond to this question as if you were a data scientist: What is the difference between classification and regression?"
prompts = [prompt_1, prompt_2]

Define the completion parameters

For a given model, there will be an assortment of parameters for generating text using that model. For our completions using opt-125m, we will use three text generation parameters that are common across many LLMs: max_tokens, temperature, and n.

max_tokens - specifies the maximum number of tokens that the LLM will generate. You can think of tokens roughly as words. In our case, we would like the generated text to have an approximate maximum length of 128 words.
Python
```
max_tokens = 128
```
temperature - specifies how random or creative the text responses from the LLM should be.
A temperature of 0 makes the text generated almost deterministic, as it limits the sampling to simply choose the token with the highest probability. Higher temperatures allow the sampling to use tokens with lower probabilities, thus generating more "creative" responses.
For this tutorial, we will set the temperature to 0.5, which should generate reasonable but different responses.
Python
```
temperature = 0.5
```
n - specifies the number of responses to generate per prompt. For example, if two prompts are included and n is set to 3, six (6) responses are generated. For this tutorial, we will set n to 2.
Python
```
n = 2
```

Tokens in LLMs

Technically speaking, LLMs do not generate tokens. Rather, to generate a token, the LLM first generates a vector of probabilities, called the logit vector, where each entry in the vector corresponds to the probability that a certain token ought to appear.

When this vector is generated, there is a random sampling from the probabilities to output one single token. Temperature allows you to control the randomness/creativity of the responses by adjusting how the random sampling of tokens from the logit vector works.

Assemble completion parameters

The completion parameters are passed to the c3.ModelInference.completion() API through the params input. See the following example code snippet:

Python

params = {
    'max_tokens' : 128,
    'temperature' : 0.5,
    'n' : 2
}

Generate text responses

Once inputs are defined, they can be passed the c3.ModelInference.completion() API to generate text responses. See the following example code snippet.

Python

res = c3.ModelInference.completion(route=route, prompts=prompts, params=params)
print(res[0]['outputs'][0]['text'])

For streamed responses, pass the inputs to the c3.ModelInference.streamCompletion() API. Note that the c3.ModelInference.streamCompletion() API can only accept one prompt at a time, unlike the c3.ModelInference.completion() API, which accepts an array of prompts.

Use C3 AI MIS for generic inference requests

The c3.ModelInference.process() API is the primary way to perform request inference from an MlAtomicPipe that has been deployed to a Model Inference Service.

The inputs to the process API are as follows:

route - string that determines which deployment will be used for inference. The route input accepts a string that refers to a specific route that the Model Inference Service application administrator has set up for you to use. The route input is often a string.
Any other inputs needed for the MlAtomicPipe's doProcess() method.

Copy link to this sectionUse C3 AI MIS for LLM text generation

Copy link to this sectionSet completion routes

Copy link to this sectionDefine the prompts

Copy link to this sectionDefine the completion parameters

Copy link to this sectionTokens in LLMs

Copy link to this sectionAssemble completion parameters

Copy link to this sectionGenerate text responses

Copy link to this sectionUse C3 AI MIS for generic inference requests

Copy link to this sectionSee also

Use C3 AI MIS for LLM text generation

Set completion routes

Define the prompts

Define the completion parameters

Tokens in LLMs

Assemble completion parameters

Generate text responses

Use C3 AI MIS for generic inference requests

See also