C3 AI Documentation Home

Configure and Manage Node Pools

This documentation discusses an in-depth guide to configuring all aspects of App.NodePools. The documentation is structured into the following parts:

  • Part 1: Configure and manage node pools
  • Part 2: Start and terminate node pools
  • Part 3: Configure hardware profiles
  • Part 4: Start an application with specified node counts
  • Part 5: Configure ephemeral volume size for a nodepool
  • Part 6: Monitor status of node pools
  • Part 7: Run jobs on node pools
  • Part 8: Current limitations of node pools and workarounds

Readers may skip to different sections based on user needs.

Background

Node pools can be created and modified as needed to support various application use cases. Node pools are comprised of Server instances, and pools are assigned a Server.Role to describe which actions should be processed by Servers in the pool. Generally, actions can be categorized into synchronous and asynchronous actions. Synchronous actions are actions like user interactions in console/Jupyter, UI page loads, and data fetches, while asynchronous actions are distributed jobs, queue tasks, etc.

C3 AI provides three predefined node pools as specified by App.NodePool.Defaults: singlenode, leader, and task. The default node pools are described in more detail below:

Node Pool NameRoleDescription
singlenode*Default node pool for all Single Node Environments. All actions are processed by a singular node.
leaderleaderDefault node pool for leader nodes, which handle synchronous actions in a Shared Environment.
tasktaskDefault node pool for task nodes, which handle all asynchronous actions in a Shared Environment.

Node pools are also assigned a HardwareProfile, which defines what resources are given to to it. As an example, this could be CPU count, memory allocation, or GPU product type.

Part 1: Configure and manage node pools

This section details how you can configure a predefined or custom node pool to scale horizontally with node counts, or vertically with hardware profiles. For most projects, the commands in this section are sufficient to perform all relevant node management tasks.

Inspect node pools

Each of the default node pools has an associated configuration (Hardware Profile, Autoscale Spec, role), enabled by the Configuration Framework, which can be viewed with the following APIs.

JavaScript
// Run in <env>/<app> console
var app = C3.app();

// This can be used to examine all node pools configured for this App.
app.nodePools();

// This can be used to examine the config for a particular node pool in this application.
// This provides information such as hardware profile, autoscale specifications, and server roles for the node pool.
app.nodePool(<node-pool-name>).config();

For example, the configuration for the singlenode pool in any Single Node Environment can be viewed with the following command:

JavaScript
var app = C3.app();
app.nodePool('singlenode').config();

The following commands can be used to view nodes (count, state, and other details) in a given node pool:

JavaScript
var app = C3.app();
app.nodePool(<node-pool-name>).nodes();

Note: The C3.app() command uses the current application context to return an instance of an App. Using this command assumes that you are in the console of the application for which you are configuring the node pools, which is the most common scenario. If you wish to modify the configuration of a node pool outside of the current app, you should use App.forId() instead:

JavaScript
// Can be run in c3/c3 for any app, or in <env>/c3 for an application in <env>
var app = App.forId(<app-name>);

Along with the default node pools, you can configure new custom node pools. Refer to the "Run Jobs on Node Pools" section for use cases.

Configure node count

You can manually control the number of nodes for non-singlenode pools. The example below shows how different node counts can be set. However, note that when setting the target node count manually, it must be within the boundaries of the minimum and maximum node count, otherwise an error results. These boundaries serve as some level of safeguard in case an incorrect value is input for the node count.

JavaScript
var app = C3.app();

// Can be 'leader', 'task', or a custom-defined node pool
var nodePool = app.nodePool('<node-pool-name>'); 

// Sets the target node count
nodePool.setNodeCount(2);

// From left to right, sets the target node count, min node count, and max node count
nodePool.setNodeCount(2, 1, 3); 

// Once the user is satisfied with the newly set node count configuration, trigger an update
nodePool.update();

An update triggers a rollout of new nodes that are ready between 1 ~ 30 minutes after the command is run based on various factors (for example, Kubernetes node autoscaling, image pulling). Existing nodes continue to serve requests until the new nodes are ready. Run C3.app().nodePool(<node-pool-name>).isReady() to verify whether the update is complete.

Configure node pool AutoScaler

To configure autoscaling, you can use the setAutoScaleSpec API for a node pool. If enabled, the autoscaler scales between the minimum and maximum number of nodes configured based on the InvalidationQueueStats fields computingEntries and awaitingCompute.

In the below example, the API is used to configure autoscaling for the task node pool:

JavaScript
// Run commands in <env>/<app> console
var app = C3.app();
var nodePool = app.nodePool('task');

// sets the target, min, max node count
nodePool.setNodeCount(1, 1, 4);

// enables autoscaler
nodePool.setAutoScaleSpec(true);
   
// Once the user is satisfied with the newly set node count configuration, trigger an update
nodePool.update()

NOTE: If autoscaling is enabled, the target node count immediately scales between the minimum and maximum node count configured for the node pool.

Node pool autoscaling is done through a C3 AI CronJob. The enabled field on the App.NodePool.AutoScaleSpec starts the C3 AI CronJob, which periodically invokes {@link App.NodePool#scale}. See also Create Long-Running Jobs.

Restart nodes in a node pool

Occasionally, restarting nodes becomes necessary - jobs can become stuck, threadlock can happen if competing jobs use all available threads, nodes can become out-of-memory (OOM), or an old configuration can be cached on the node. Nodes in a node pool are instances of Server, and the Server.restart() command can be used to restart them. When you restart a node, it will be unresponsive (and return a HTTP 503 error) for 1-5 minutes.

Single Node Environment

In a Single Node Environment, you are directly interacting with the only Server. You can run Server.restart() in the C3 AI console (<env>/c3 console, or any of the app consoles) to restart the node.

Shared Environment

In a Shared Environment, you must call Server.restart() on each node individually to restart it. For example, you can restart all task nodes with the following code:

JavaScript
// Run commands in <env>/<app> console
var app = C3.app();

// Get all nodes in the task node pool
var nodes = app.nodePool('task').nodes();

// Iterate through each node and restart it
nodes.forEach(node => node.restart())

Rolling Restart

For task nodes in production environments, or for leader nodes in any environment, it is more desirable to perform a "rolling restart" of all nodes in a pool to avoid downtime for an entire node pool. A rolling restart means restarting one node, waiting for it to become responsive again, restarting the next node, waiting for it to become responsive again, and repeating for any remaining nodes. An example script for a rolling restart on leader nodes is below:

JavaScript
// Run commands in <env>/<app> console
var app = C3.app();

// Get all nodes in the leader node pool
var nodes = app.nodePool('leader').nodes();

// Manually restart node i from the array, WAIT 5-10 minutes, and repeat for node i+1
var i = 0;
nodes[i].restart()

Manage storage lifecycle

Single Node Environment

For single node environments, the App.Nodepool.Config.SharedStorage#enabled flag controls the lifecycle of the storage backing all the directories for the c3 server container.

When enabled is set to true, the storage lifecycle is tied to the environment's lifecycle, meaning, the storage is persisted across all restarts/reschedules. When enabled is set to false, the storage is tied to the JVM lifecycle, meaning, the storage is NOT persisted across all restarts/reschedules.

Shared Environment

The App.Nodepool.Config.SharedStorage#enabled flag is not applicable for shared environments. For shared environments, the storage lifecycle is always tied to the JVM lifecycle, meaning, the storage is NOT persisted across restarts/reschedules.

Advanced: Use setter methods to update existing node pools

To update an existing node pool, setter methods can be invoked to reconfigure the existing node pool. Below is a complete list of all setter methods that exist. A HardwareProfile is referenced in the below commands. For more information on HardwareProfiles, jump to the "Configure hardware profiles" section.

JavaScript
var app = C3.app();

app.nodePool(<node-pool-name>)
   .setNodeCount(3, 2, 4)                                                 // sets the target, min, max node count
   .setHardwareProfile(1, 2000, 1, GpuVendor.NVIDIA, "nvidia-tesla-t4")   // sets the cpu, memoryMb, gpu, gpu products
   .setAutoScaleSpec(true);                                               // enables autoscaler
   .setJvmSpec(0.8)                                                       // sets the max fraction to be reserved for JVM
   
// Make sure to update the node pool after changing configurations
app.nodePool(<node-pool-name>).update()

Below are examples of common use cases.

JavaScript
var app = C3.app();

// Sets the target node count to 1, min node count to 1, max node count to 3, and enables autoscale.
app.nodePool(<node-pool-name>)
   .setNodeCount(1,1,3)
   .setAutoScaleSpec(true);

// Sets the target node count to 2, vCPU to 4, and memory to 4Gi.
app.nodePool(<node-pool-name>)
   .setNodeCount(2)
   .setHardwareProfile(4, 4000);
   
// Sets the target node count to 2, min node count to 1, max node count to 3, vCPU to 4,
// memory to be 4Gi, GPU to be 1, and the GPU product type to be Nvidia-Tesla-T4.
app.nodePool(<node-pool-name>)
   .setNodeCount(2, 1, 3)
   .setHardwareProfile(4, 4000, 1, GpuVendor.NVIDIA, "nvidia-tesla-t4");
   
// Sets the vCPU to 4, memory to 4Gi, and number of GPU cores to 0.
// Setting the number of GPU cores to be 0 will also clear existing GPU product configs
app.nodePool(<node-pool-name>)
   .setHardwareProfile(4, 4000, 0);

// Once the user is satisfied with the newly set configurations, trigger an update
app.nodePool(<node-pool-name>).update();

NOTE: An update triggers a rollout of new nodes that are ready between 1 ~ 30 minutes after the command is run based on various factors (for example, Kubernetes node autoscaling, image pulling). Existing nodes continue to serve requests until the new nodes are ready. Run C3.app().nodePool(<node-pool-name>).isReady() to verify whether the update is complete.

Part 2: Start and terminate node pools

Start and update node pools

Once the application is up and running, you might want to start new node pools. Examples for how to do this are listed below based on the level of configurability that is desired.

NOTE: Node pools act differently for Single Node Environment (SNE) deployments. In the case of an SNE, a standalone server node handles all roles by design and additional nodes cannot be assigned. If additional hardware (for example, GPU product type) or resources are necessary, the standalone server itself needs to scale vertically.

NOTE: In an SNE, you cannot configure any new node pools. You can only update the singlenode nodepool, which can only be done in the c3 application (<env>/c3).

NOTE: An update triggers a rollout of new nodes that are ready between 1 ~ 30 minutes after the command is run based on various factors (for example, Kubernetes node autoscaling, image pulling). Existing nodes continue to serve requests until the new nodes are ready. Run C3.app().nodePool(<node-pool-name>).isReady() to verify whether the update is complete.

Case 1: Start task node pools or restart an SNE with new CPU and memory configurations

Sometimes, you might need to use predefined node pools with just a few modifications (for example, CPU, memory, GPU).

JavaScript
// Configures and updates predefined task or singlenode node pool with target node count as 1, vCPU as 4, memory as 4Gi.
// SNE only supports 1 target node count and does not support autoscaling for reasons stated above.
var app = C3.app();
app.updateTaskNodePool(1, 4, 4000, true); // with autoscaling
app.updateTaskNodePool(1, 4, 4000);       // without autoscaling

NOTE: The target node count is bound by the minimum and maximum node count. If the target count is beyond the configured boundaries, reset the minimum and maximum boundaries with setNodeCount. Refer to the "Configure node count" section for details.

Case 2: Reconfigure certain fields (for example, GPU and JVM heap size) and update node pools

In certain cases, you might need to modify additional fields, such as GPU profiles, which are not configurable through updateTaskNodePool. If this is the case, refer to setter methods as described in "Configure and manage node pools" section.

Case 3: Configured a custom node pool and need to start nodes of that node pool

Refer to the custom node pool configuration sub-section in "Configure and manage node pools" section of this topic. This is not applicable for SNE.

Stop and terminate node pools

Once users are finished using nodes of a particular node pool, they can hibernate or terminate those nodes using the following APIs.

JavaScript
var app = C3.app();
// API for hibernation 
app.nodePool(<node-pool-name>).stop();

// API for termination
app.nodePool(<node-pool-name>).terminate();

It is also possible to just terminate node pools that serve as C3server task nodes. In this case, leader nodes remain active. Note that while SNEs also serve as task nodes, they will not be terminated as part of the below API.

JavaScript
var app = C3.app();
// API for terminating all task nodes
app.terminateTaskNodePools();

Part 3: Configure hardware profiles

HardwareProfiles are used to define what resources are given to an App.NodePool. As an example, this could be CPU count, memory allocation, or GPU product type.

Below is an example of a way to configure a new hardware profile with two (2) virtual CPUs and 10 Gb of memory.

JavaScript
var hardwareProfile = HardwareProfile.upsertProfile({
    name: '2vCpu10mem',
    cpu: 2,
    memoryMb: 10000
});

Below is an example of a way to configure a new hardware profile with two (2) Nvidia-Tesla-T4 GPUs.

JavaScript
var gpuHwProfile = HardwareProfile.upsertProfile({
  name: '1xNvidiaT4w2vcpu10mem',
  cpu: 2,
  memoryMb: 10000,
  gpu: 2,
  gpuKind: 'nvidia-tesla-t4',
  gpuVendor: 'nvidia'
});

Note: For GPUs to be usable by the C3 Agentic AI Platform, Nvidia drivers must first be installed. This must be done once per cluster. See Install Drivers.

Note: Due to limitations around GPU quotas and K8sNodePool discovery, it is recommended that all GPU attached hardware profiles first be set up by the cluster admin for other users to consume. The same applies to hardware profiles with large CPU and memory configurations.

Below are ways to examine existing hardware profiles.

JavaScript
// List all available hardware profiles
HardwareProfile.listConfigs().collect();

// Fetch a given hardware profile
HardwareProfile.forName(<name-of-profile>);

See Use Hardware Profiles for more information.

Case 3: Configure a new custom node pool

This section conveys how to configure new custom node pools with full control over all fields. In most cases, you would not need to use this API.

JavaScript
// Step 1) Fetch the hardware profile, see "Configure hardware profiles" section for more
var hardwareProfile = HardwareProfile.forName(...);

// Step 2) Configure custom node pool
var app = C3.app();
app.configureNodePool(<node-pool-name>,                    // name of the node pool to configure 
                      1,                                   // sets the target node count
                      0,                                   // sets the minimum node count
                      3,                                   // sets the maximum node count
                      hardwareProfile,                     // sets the hardware profile
                      [Server.Role.TASK],                  // sets the server role that this node pool will function as
                      true,                                // optional - specifies whether autoscaling should be enabled
                      0.7,                                 // optional - specifies the JVM max memory fraction for c3server 
                      "App.NodePool.Membership.VertexAI"); // optional - specifies the membership for this node pool
                      
// Step 3) Trigger an update to bring up nodes for the custom node pool
app.nodePool(<node-pool-name>).update();

Once a node pool is configured, one can check that the node pool is configured in Configuration Framework through one of the following APIs.

JavaScript
var app = C3.app();

// This can be used to examine all node pools configured for this App
app.nodePools();

// This can be used to examine a particular node pool configured for this App
app.nodePool(<node-pool-name>);

// This can be used to examine the config for a particular node pool configured for this application
// This provides information such as hardware profile, autoscale specifications, and server roles for the node pool
app.nodePool(<node-pool-name>).config();

Part 4: Start an application with specified node counts

Start applications with specified node counts

NOTE: The below APIs can only be used to start predefined node pools with a specified count.

The below is a snippet for starting task node pools with a specified node count.

JavaScript
// This starts an application with the target node count for task nodes set to 2 and autoscaling disabled
// Note the 'autoScale' field is true by default 
Env.forName(...).startApp({'name': ..., 'taskNodeCount': 2, autoScale': false});

// This starts an application with the max node count for task nodes set to 2 and autoscaling enabled
// It is ineffective to set the target node count in this case as it will immediately be downscaled
Env.forName(...).startApp({'name': ..., 'taskNodeCount': 2, autoScale': true});

Part 5: Configure ephemeral volume size for a nodepool

When it is required to configure the size of a volume, (for example, shared memory). The following snippet can be used to reconfigure size of a volume.

JavaScript
var app = C3.app()
volumeMapping = {};
volumeMapping[App.NodePool.Config.Volumes.DEVSHM] = '5Gi';
app.nodePool(<node-pool-name>).config().withVolumeSizeMapping(volumeMapping).setConfig('APP');
app.nodePool(<node-pool-name>).update();

The c3-server container comes with ephemeral-storage limit which corresponds to total ephemeral storage for the container. If the container ephemeral volume usage is higher than this limit, the pod can crash and eventually gets evicted.

JavaScript
app.nodePool(<node-pool-name>).config().withHardwareProfile({'diskGb': 50}).setConfig('APP');
app.nodePool(<node-pool-name>).update();

Part 6: Monitor status of node pools

Status report of node pools

When nodes of a node pool are started or restarted by using start or update, it takes time for the the cloud provider to provision new nodes and for C3 AI server to bootstrap.

Run C3.app().nodePool(<node-pool-name>).isReady() to see if the node pool is ready.

Run C3.app().nodePool(<node-pool-name>).nodeStates() for a detailed summary of all node pools that are in the node pool.

Statuses reported by this API are meant to be eventually consistent but also highly reliable. However, it is also possible to get a real-time picture in case the state of node pools captured by snapshots is outdated.

Below are examples of what you can expect from this API.

  • If C3server is running, return a clear picture that it is running and ready to serve requests.
  • If C3server is bootstrapping, return a clear picture that it is currently bootstrapping.
  • If C3server is failing to start, return a clear message why (for example, ContainerCreating, ImagePullBackOff).

Part 7: Run jobs on node pools

This section discusses how to schedule jobs on predefined or custom node pools.

Case 1: Assign jobs to a particular node pool

If a job must be assigned to specific nodes of a node pool (for example, assigned on high memory or GPU nodes), do the following.

JavaScript
// Assuming a node pool called 'gpu' is up and running, this submits a job to nodes of the 'gpu' node pool
ActionQueue.submitAction(C3.type("User"), "myUser", null, true, 'gpu');

Case 2: Assign jobs without consideration of node pool

By default, jobs that do not specify what node pool to run on are not scheduled to custom node pools. This is to prevent jobs from unintentionally running on nodes reserved for GPU or high-memory capabilities. For jobs that do not specify a particular node pool, jobs are scheduled on the default (or predefined) task nodes or single node.

JavaScript
// Will be assigned to default 'task' node pool for multi-node deployments or 'singlenode' node pool for SNE 
ActionQueue.submitAction(C3.type("User"), "myUser", null, true, null);

Part 8: Current limitations of node pools and workarounds

This section contains limitations in regard to node pools.

Limitation 1: Can only configure a node pool with lowercase alphanumeric values.

Names for node pools are limited to contain only lowercase alphanumeric values. As best practice, make sure the name of the node pool is meaningful to its purpose.

Limitation 2: The state of terminating pods is not reflected as 'terminating' but shows up as 'running' in Kubernetes. Sometimes, more nodes are shown than what was configured for the node pool.

App.NodePool#isReady() and App.NodePool#nodeStates() are built for accuracy and designed to give the precise state shown by Kubernetes JSON payloads. As an example, in the event of a server upgrade, the existing pod hosting the C3 Agentic AI Platform are scheduled for termination and at the same time a new pod starts bootstrapping.

Terminating pods continue to serve requests and are 'running' until the new pods are fully ready. In this brief moment, Kubernetes JSON payloads reports that there is one 'running' pod and one 'starting' pod even if the target number of pods for the C3 Agentic AI Platform is 1 and the existing pod is terminating.

As App.NodePool#nodeStates() provides an accurate picture of what is being reported by Kubernetes JSON payloads, it is the user's responsibility to understand how to interpret such behaviors to fully benefit from information given by this API.

See also

Was this page helpful?