Isolation Forest

Find anomalies using the isolation forest algorithm in Visual Notebooks. Isolation forest models isolate outliers using decision trees. These models are commonly used to identify fraud and mechanical failures.

Configuration

Field	Description
Name default=none	Name of the node A user-specified node name displayed in the canvas, both on the node and in the dataframe as a tab.
Select Feature Columns default=`Select all columns as features`	The columns to search for anomalies Use all columns as features or select specific columns from the dropdown menu.

Advanced Configuration

Optionally alter the advanced configuration fields to control the output of the node.

Hyperparameters Search

Visual Notebooks trains many different models with various hyperparameter combinations. The fields in this section determine the hyperparameter options used during training. Although you don't need to alter these fields to train a high-performing model, it can be interesting to explore different combinations.

Hyperparameters give you precise control over a model. In general, the goal of changing the hyperparameters is to make the best possible model while avoiding overfitting. A model is considered overfit when it is too closely aligned to the training data to produce accurate predictions on unseen data.

Field	Description
Hyperparameters Search default=`Search`	Train one model or multiple models Select Search to train multiple models with different hyperparameter combinations and then compare the models to find the best one. Select Fixed to train a single model with a fixed hyperparameter configuration.
Number of Trees in Forest default=`20, 50, 100, 150, 200`	The number of trees to build Enter an integer between 1 and 10,000. More trees create a more accurate model, but can lead to overfitting. Values between 50 and 200 are common. If you define a fixed model, the default is 50.
Maximum Tree Depth default=`4, 8, 12, 16, 20, 40`	The maximum number of levels in each tree Enter an integer between 0 and 100. Setting this value to 0 specifies no limit. Increasing the tree depth allows the model to fine-tune its performance, but may lead to overfitting. If you define a fixed model, the default is 20.
Sample Size default=`256`	The number of samples to use to train each tree If Define Fixed is selected, specify the number of randomly sampled observations used to train each tree.

Contamination Property

The contamination property tells the isolation forest model what percentage of the data you expect to be anomalous. A contamination value is not required to train a model, but must be provided in order to generate Boolean predictions for anomalies.

Field	Description
Contamination default=`Don't Specify Contamination`	The proportion of anomalies The contamination ratio is the percentage of anomalies in the input dataset. If Don't Specify Contamination is selected, models do not calculate Boolean predictions for anomalous values.

Validation Settings

An isolation forest model attempts to find previously unidentified anomalies in data. Since the results are unknown before training, there isn't a way to validate the model's predictions.

Additional test and validation methods will be implemented at a later date. When that occurs, use this field to determine whether the model is consistent when given additional data.

Field	Description
Select test and validation method default=`No Validation`	Validation method Additional test and validation methods are coming soon.

Initialization Seed

Random numbers are used throughout the training process for splitting the original dataset, splitting individual trees, and optimizing hyperparameters. Ex Machina uses one number, called a seed, to generate those random numbers. The field in this section allows you to enter a custom seed. If you enter a custom seed, you can enter that same custom seed at a later date to reproduce the results of training.

Field	Description
Seed default=`Random`	The number used throughout the AutoML process Select Random to use a random number, or select Custom to enter a specific integer.

Output Column Names

Field	Description
Prediction default=`prediction`	The column with the model's predictions Enter a name for the column that contains the selected model's predictions. Column names can contain alphanumeric characters and underscores, but cannot contain spaces. This column contains a value of `True` if the data point is considered an anomaly, and contains a value of `False` if the data point is normal. If Don't Specify Contamination is selected for the Contamination field, the prediction column will not be visible in the resulting dataframe.
Anomaly Score default=`anomaly_score`	The column with the anomaly score Enter a name for the column that contains the anomaly scores. Column names can contain alphanumeric characters and underscores, but cannot contain spaces. The anomaly score is typically between 0 and 1. Data points with higher scores are more anomalous.
Mean Length default=`mean_length`	The column with the mean length Enter a name for the column that contains the mean length. Column names can contain alphanumeric characters and underscores, but cannot contain spaces. The mean length is the average number of splits it takes to isolate a data point across all decision trees in the isolation forest. Values with low scores close to 1 are more likely to be anomalous since it takes fewer partitions of the data to isolate them.

Prediction

The output of this node is each model's predictions on the training data. This section determines how the predictions are portrayed in the resulting dataframe.

Field	Description
Prediction Column Name default=`prediction`	The column name for the model's predictions Enter a name for the column that contains the selected model's predictions. Column names can contain alphanumeric characters and underscores, but cannot contain spaces.
Dataset Selection default=`Train Dataset`	Data used to display a model's predictions Visual Notebooks displays a selected model's predictions on the dataset selected with this field.
Include all columns default=off	Whether to include all columns in the predictions table Toggle this field on to include all columns in the predictions table, including the columns that you did not use as features for the model. By default, only columns you selected as features will be included.

Node Inputs/Outputs

Input	A Visual Notebooks dataframe
Output	Trained isolation forest models and a dataframe with predictions on the training data

Example models

Example model predictions

Figure 1: Example output

Examples

The dataframe shown in Figure 2 contains the job title and yearly salary of more than 600 data science professionals. Use the Isolation Forest node to find any unusually high or unusually low salaries.

Example input data

Copy link to this sectionConfiguration

Copy link to this sectionAdvanced Configuration

Copy link to this sectionHyperparameters Search

Copy link to this sectionContamination Property

Copy link to this sectionValidation Settings

Copy link to this sectionInitialization Seed

Copy link to this sectionOutput Column Names

Copy link to this sectionPrediction

Copy link to this sectionNode Inputs/Outputs

Copy link to this sectionExamples