C3 AI Documentation Home

Isolation Forest

Find anomalies using the isolation forest algorithm in Visual Notebooks. Isolation forest models isolate outliers using decision trees. These models are commonly used to identify fraud and mechanical failures.

Configuration

FieldDescription
Name default=noneName of the node A user-specified node name displayed in the canvas, both on the node and in the dataframe as a tab.
Select Feature Columns default=Select all columns as featuresThe columns to search for anomalies Use all columns as features or select specific columns from the dropdown menu.

Advanced Configuration

Optionally alter the advanced configuration fields to control the output of the node.

Visual Notebooks trains many different models with various hyperparameter combinations. The fields in this section determine the hyperparameter options used during training. Although you don't need to alter these fields to train a high-performing model, it can be interesting to explore different combinations.

Hyperparameters give you precise control over a model. In general, the goal of changing the hyperparameters is to make the best possible model while avoiding overfitting. A model is considered overfit when it is too closely aligned to the training data to produce accurate predictions on unseen data.

FieldDescription
Hyperparameters Search default=SearchTrain one model or multiple models Select Search to train multiple models with different hyperparameter combinations and then compare the models to find the best one. Select Fixed to train a single model with a fixed hyperparameter configuration.
Number of Trees in Forest default=20, 50, 100, 150, 200The number of trees to build Enter an integer between 1 and 10,000. More trees create a more accurate model, but can lead to overfitting. Values between 50 and 200 are common. If you define a fixed model, the default is 50.
Maximum Tree Depth default=4, 8, 12, 16, 20, 40The maximum number of levels in each tree Enter an integer between 0 and 100. Setting this value to 0 specifies no limit. Increasing the tree depth allows the model to fine-tune its performance, but may lead to overfitting. If you define a fixed model, the default is 20.
Sample Size default=256The number of samples to use to train each tree If Define Fixed is selected, specify the number of randomly sampled observations used to train each tree.

Contamination Property

The contamination property tells the isolation forest model what percentage of the data you expect to be anomalous. A contamination value is not required to train a model, but must be provided in order to generate Boolean predictions for anomalies.

FieldDescription
Contamination default=Don't Specify ContaminationThe proportion of anomalies The contamination ratio is the percentage of anomalies in the input dataset. If Don't Specify Contamination is selected, models do not calculate Boolean predictions for anomalous values.

Validation Settings

An isolation forest model attempts to find previously unidentified anomalies in data. Since the results are unknown before training, there isn't a way to validate the model's predictions.

Additional test and validation methods will be implemented at a later date. When that occurs, use this field to determine whether the model is consistent when given additional data.

FieldDescription
Select test and validation method default=No ValidationValidation method Additional test and validation methods are coming soon.

Initialization Seed

Random numbers are used throughout the training process for splitting the original dataset, splitting individual trees, and optimizing hyperparameters. Ex Machina uses one number, called a seed, to generate those random numbers. The field in this section allows you to enter a custom seed. If you enter a custom seed, you can enter that same custom seed at a later date to reproduce the results of training.

FieldDescription
Seed default=RandomThe number used throughout the AutoML process Select Random to use a random number, or select Custom to enter a specific integer.

Output Column Names

FieldDescription
Prediction default=predictionThe column with the model's predictions Enter a name for the column that contains the selected model's predictions. Column names can contain alphanumeric characters and underscores, but cannot contain spaces. This column contains a value of True if the data point is considered an anomaly, and contains a value of False if the data point is normal. If Don't Specify Contamination is selected for the Contamination field, the prediction column will not be visible in the resulting dataframe.
Anomaly Score default=anomaly_scoreThe column with the anomaly score Enter a name for the column that contains the anomaly scores. Column names can contain alphanumeric characters and underscores, but cannot contain spaces. The anomaly score is typically between 0 and 1. Data points with higher scores are more anomalous.
Mean Length default=mean_lengthThe column with the mean length Enter a name for the column that contains the mean length. Column names can contain alphanumeric characters and underscores, but cannot contain spaces. The mean length is the average number of splits it takes to isolate a data point across all decision trees in the isolation forest. Values with low scores close to 1 are more likely to be anomalous since it takes fewer partitions of the data to isolate them.

Prediction

The output of this node is each model's predictions on the training data. This section determines how the predictions are portrayed in the resulting dataframe.

FieldDescription
Prediction Column Name default=predictionThe column name for the model's predictions Enter a name for the column that contains the selected model's predictions. Column names can contain alphanumeric characters and underscores, but cannot contain spaces.
Dataset Selection default=Train DatasetData used to display a model's predictions Visual Notebooks displays a selected model's predictions on the dataset selected with this field.
Include all columns default=offWhether to include all columns in the predictions table Toggle this field on to include all columns in the predictions table, including the columns that you did not use as features for the model. By default, only columns you selected as features will be included.

Node Inputs/Outputs

InputA Visual Notebooks dataframe
OutputTrained isolation forest models and a dataframe with predictions on the training data

Example models

Example model predictions

Figure 1: Example output

Examples

The dataframe shown in Figure 2 contains the job title and yearly salary of more than 600 data science professionals. Use the Isolation Forest node to find any unusually high or unusually low salaries.

Example input data

Was this page helpful?