Configure Alerts to Monitor System Health

Configure alerts to monitor application and environment health. You can create alerts for any environment or application, set alert rules using templates, and view alerts across applications to prioritize and investigate.

You can create alerts in C3 AI Studio to notify you when:

Prediction persistence drops in a production application.
Nodes serving your production application experience a catastrophic failure, like OOM or Deadlock.
Number of requests increases by more than 20% in a specified period of time.
The error rate on the application servers increases significantly.
Invalidation queue throughput for a specific queue or set of queues drops.

Requirements

To configure and view alerts, you must have all of the following:

A running environment or application that you want to set up alerts for.
A user with the StudioAdmin role to enable application monitoring.
One of the following access controls within the application:
- The AppAdmin role to configure alerts.
- Access to an application to view alerts and alert rules for the application.

View alerts

To view alerts for an application, complete the following steps:

In C3 AI Studio, navigate to the Apps page.
Select an application you want to set up alerts for.
Select Alert Center.

To view alerts for an environment, complete the following steps:

In C3 AI Studio, navigate to the Envs page.
Select an environment you want to set up alerts for.
Select Alert Center.

Alerts display in the list under Alert Rules.

You can filter and aggregate alerts by the following metrics:

Severity
Status
Status Since (Start Date)
Category
Template
Alert Rule

To adjust alert list frequency, select a value from the Auto Sync dropdown. Automatic synchronization defaults to Off.

The signal status of a firing alert automatically changes from Firing to Ok when the alert no longer meets the alert rule criteria.

To see all the alerts that are firing based on an alert rule, select an alert. The alert displays with cards showing each firing alert. Select a card to learn more about the alert.

Create an alert rule

In the Alert Center, select Add New Alert Rule or the plus sign (+).
Select a template. See the "Alert templates" section for details.
Select Configure.
Enter a Name for the alert rule.
Enter information for the target. See the "Set alert target" section for details.
Set alert conditions. See the "Set alert conditions" section for details.
Set the severity level. See the "Set severity level" section for details.
Select Save.

Edit an alert rule

In the Alert Center, select an alert.
Select Edit Alert Rule.
Make changes to the alert rule.
Select Save.

Disable or enable an alert rule

Disable an alert rule to turn it off. Enable an alert rule to turn it on.

In the Alert Center, select an alert.
Select the ellipsis button (...) next to the alert.
Select Disable Alert Rule. If the alert is already disabled, the ellipsis button displays Enable Alert Rule instead.
Select Disable or Enable.

Remove an alert rule

In the Alert Center, select an alert.
Select the trash can icon (Remove Alert Rule) next to the alert.
In the Remove Alert Rule window, select Remove.

Alert templates

Use any of the following templates to configure environment alerts:

CPU Usage: Triggers an alert when CPU usage exceeds the specified threshold
Ephemeral Storage Usage: Triggers when percentage of ephemeral storage utilization meets the configured criteria.
Frequent container restarts: Triggers when the any node has restarted more times than the configured threshold over the configured time range.
HTTP Response Error Count: Triggers when the count of 4xx and 5xx HTTP status codes returned by the server over a 5 minute period meets the configured criteria.
HTTP Response Error Trend: Triggers when the rate of 4xx and 5xx HTTP status codes returned by the server changes against a configured baseline.
JVM Memory Usage: Triggers an alert when JVM memory usage exceeds the specified threshold
JVM Thread State Percentage: Triggers when a percentage of JVM threads are in the specified states.

Use any of the following templates to configure application alerts:

Invalidation Queue Blocked: Triggers when the configured invalidation queue 'awaiting' count meets the configured criteria.
Invalidation Queue Errors: Triggers when the configured invalidation queue 'failed' count meets the configured criteria.
Invalidation Queue Stats: Triggers when the configured invalidation queue 'stat' meets the configured criteria.
Invalidation Queue Stats: Triggers an alert whenever the configured Invalidation Queue Stat exceeds the specified threshold
ML Job error rate: Triggered when any ML job has errors that reach a certain threshold.
Average ML Job Prediction Throughput: Triggers when the average throughput of ML job predictions meets the configured criteria.
Studio GenAi Agent Health: Triggered when the Studio GenAi Agent is non-responsive or the response is not as expected over the configured time range.
Studio GenAi Agent Health Monitoring Template: Triggered when the Studio GenAi Agent is non-responsive or the response is not as expected over the configured time range.

Configure alert rules

You can configure target and alert conditions. Your exact options depend on the selected alert template and the alert target (an environment or application).

Set target conditions

For the Frequent container restarts template, set the Lookback interval value as how far in the past the alert rule looks back to compute the alert statistic.

For the CPU Usage and JVM Memory Usage templates, set the match value to configure vector matching. To learn more about vector matching, see the Prometheus documentation.

For the Invalidation Queue Stats template, set the following target values:

Stat: The invalidation queue statistic to monitor.
Queue: The invalidation queue to monitor.

For the ML job error rate template, set the following target values:

ML Job Type: The machine learning project.
Project Id: ID of the machine learning project.

Set alert conditions

When you create an environment or application alert, you can configure the following alert conditions:

Comparison Operator: The mathematical operator used to compute the alert statistic.
Threshold: The number to which to compare the operator metric.
Monitoring Interval: The amount of time that the condition needs to be true before an alert is triggered.

Set severity level

Set the default severity level that applies to an alert rule when you create the rule.

Copy link to this sectionRequirements

Copy link to this sectionView alerts

Copy link to this sectionCreate an alert rule

Copy link to this sectionEdit an alert rule

Copy link to this sectionDisable or enable an alert rule

Copy link to this sectionRemove an alert rule

Copy link to this sectionAlert templates

Copy link to this sectionConfigure alert rules

Copy link to this sectionSet target conditions

Copy link to this sectionSet alert conditions

Copy link to this sectionSet severity level

Copy link to this sectionSee also