Random Forest Regression

Train a random forest regression model that can predict continuous values in C3 AI Visual Notebooks. Random forest models consist of many decision trees whose predictions are averaged to output a final prediction. To learn more about random forest models, see the C3 AI glossary.

Configuration

Field	Description
Name default=none	Name of the node A user-specified node name displayed in the workspace, both on the node and in the dataframe as a tab.
Select Column with Labels Required	The column that the random forest regression model should predict Select a column from the dropdown menu. This column contains the values that the model should be able to predict after training.

Advanced Configuration

Optionally alter the advanced configuration fields to control the output of the node.

Select Features

Field	Description
Select Features default=`Select all columns as features, excluding the selected label column`	Features to train the model with Use all columns as features, or select specific columns using the dropdown menu. Columns selected as features are used to train the model.
Select optional timeseries column default=`Off`	Timeseries column If there is a timeseries column in your data, check the box in this field and select the timeseries column from the auto-populated dropdown menu. Timeseries information is used when splitting the data into separate train, validation, and test datasets.

Test and Validation Settings

When training models, data is split into multiple components. The bulk of the data is used for training and validation, while a small portion is set aside for testing. The fields in this section determine what percentage of the data is used for training, how the data is used during the training process, and the strategy used to split the data.

Field	Description
Select test and validation method default=`Train-validation-split`	Test and train method Select Train-validation-split to split the dataset into separate train, validation, and test datasets. Select Cross-validation to split the data into a specified number of subsets. During training, one subgroup is used for testing and validation, while the other subgroups are used for training. The process is then repeated so each subgroup is used as the testing and validation group once.
Select percentage split default=`Train: 70%, Validation: 15%, Test: 15%`	Data split percentage Move the slider to split the data into test, validation, and train datasets. If Cross-validation is selected in the Select test and validation method field, move the slider to split the data into a train dataset that will be divided into subgroups, and a separate testing dataset. The default split when using the cross-validation method is 80% train and 20% test.
Select number of cross-validation folds default=`6`	Number of cross-validation subgroups Enter a number between 2 and 20. The data allocated for training is divided into the specified number of subgroups.

Scorer and Stopping Conditions

By default, Visual Notebooks trains many models with different hyperparameter configurations, then ranks the models by performance. The fields in this section tell Visual Notebooks when to stop making new models. You can stop making models once the new models no longer substantially improve upon the existing models. Alternatively, you can stop making new models after a specified number of models have been trained or a certain amount of time has passed.

Field	Description
The performance metric default=`Mean Residual Deviance`	The performance metric used to stop hyperparameter search Select Mean Residual Deviance, MSE, RMSE, MAE, or RMSLE. When training multiple models with different hyperparameter combinations, stop creating models when the new models fail to improve a specified performance metric. Performance metrics are always some sort of error. Errors measure how far off the predictions of the model are to the actual values. When the model predicts a value close to the actual, the error is small, and when the model predicts a value far from the actual, the error is large. Here is an overview of these error metrics:MSE (Mean Squared Error): The average of the squared errors. MSE penalizes large errors harshly. RMSE (Root Mean Squared Error): The square root of MSE. RMSE also penalizes large errors harshly, and it is more interpretable by virtue of being measured in the same units as the labels and predictions. MAE (Mean Absolute Error): The average of the absolute errors. MAE does not penalize large errors as harshly as MSE or RMSE, for it does not square the errors. RMSLE (Root Mean Squared Logarithmic Error): The square root of the average of the log errors. RMSLE is primarily used when the ratio of the true value to the predicted value is the priority. Whereas all the other metrics above will penalize incorrect predictions made on large values more harshly, RMSLE does not. Thus, RMSLE is commonly used when there is a very large range of possible values . RMSLE is also famous for penalizing underestimates more harshly than overestimates and is thus used to rank models for siutations in which making an underestimate is worse than making an overestimate. This field is used in conjunction with the following two fields.
Does not improve by more than default=`0.01%`	The specific threshold used to stop hyperparameter search Select 0.1%, 0.01%, 0.001%, or 0.0001%. When training multiple models with different hyperparameter combinations, stop creating models when the new models fail to improve the specified performance metric by the given percentage. This field is used in conjunction with the fields directly above and below.
After the following number of consecutive training rounds default=`5`	The criteria used to stop hyperparameter search Select a number between 2 and 10. When training multiple models with different hyperparameter combinations, stop creating models when the new models fail to improve the specified performance metric after the following number of consecutive training rounds. This field is used in conjunction with the two fields above.
A maximum # of models have been trialed default=`10`	How many models to train Select 3, 5, 10, 20, 50, 100, 200, or 500. When training multiple models with different hyperparameter combinations, stop creating new models after a specified number of models are created.
A specified amount of training time passes default=`10 minutes`	When to stop training new models Select 5 minutes, 10 minutes, 20 minutes, 30 minutes, 1 hour, 2 hours, 12 hours, or 24 hours. When training multiple models with different hyperparameter combinations, stop creating new models after a specified amount of time passes.

Hyperparameters Search

As mentioned in the previous section, Visual Notebooks trains many different models with various hyperparameter combinations. The fields in this section determine the hyperparameter options used during training. Although you don't need to alter these fields to train a high-performing model, it can be interesting to explore different combinations.

Hyperparameters give you precise control over a model. You can use these to tell the model how quickly to learn, when to stop improving, and what to prioritize during the learning process. In general, the goal of changing the hyperparameters is to make the best possible model while avoiding overfitting. A model is considered overfit when it is too closely aligned to the training data to produce accurate predictions on unseen data.

Field	Description
Hyperparameters Search default=`Search`	Train one model or multiple models Select Search to train multiple models with different hyperparameter combinations and then compare the models to find the best one. Select Fixed to train a single model with a fixed hyperparameter configuration.
Number of Trees in Forest default=`20, 50, 100, 150, 200`	The number of trees to grow Enter an integer between 1 and 10,000. More trees create a more accurate model, but can lead to overfitting. Values between 50 and 200 are common. If you define a fixed model, the default is 50.
Maximum Tree Depth default=`4, 8, 12, 16, 20, 40`	The maximum number of levels in each tree Enter an integer between 0 and 100. Setting this value to 0 specifies no limit. Increasing the tree depth allows the model to fine-tune its performance, but may lead to overfitting. If you define a fixed model, the default is 20.
Minimum Number of Bins default=`16, 32, 64, 128, 256, 512`	The minimum number of bins to make when splitting Specify the minimum number of bins to have in the histogram, then split at the best point. When deciding where to split a decision tree, Visual Notebooks considers ranges, or bins, of the features. Visual Notebooks then chooses to split the based on the bin that will reduce the error of the tree the most. Increasing this value makes the model more accurate, but may result in overfitting. If you define a fixed model, the default is 32.
Number of Bin Categories default=`128, 256, 512, 1024, 2048, 4096`	The number of bins for categorical data Specify the number of bins used to split the trees for categorical data at the root level. This number is decreased by a factor of two at each new level until it reaches Minimum Number of Bins. Increasing this value makes the model more accurate, but can result in overfitting and increased runtime. If you define a fixed model, the default is 1024.
Number of Bins Top Level default=`128, 256, 512, 1024, 2048, 4096`	The number of bins for the first split Specify the number of bins used to split the trees at the root level. This number is decreased by a factor of two at each new level until it reaches Minimum Number of Bins. Increasing this value makes the model more accurate, but can result in overfitting and increased runtime.If you define a fixed model, the default is 1024.

Repeatability Seed

Random numbers are used throughout the training process for splitting the original dataset, splitting individual trees, and optimizing hyperparameters. Ex Machina uses one number, called a seed, to generate those random numbers. The field in this section allows you to enter a custom seed. If you enter a custom seed, you can enter that same custom seed at a later date to reproduce the results of the training.

Field	Description
Seed default=`Random`	The number used throughout the AutoML process Select Random to use a random number, or select Custom to enter a specific integer.

Prediction

The output of this node is each model's predictions on the training data. This section determines how the predictions are portrayed in the resulting dataframe.

Field	Description
Prediction Column Name default=`prediction`	The column name for the model's predictions Enter a name for the column that contains the selected model's predictions. Column names can contain alphanumeric characters and underscores, but cannot contain spaces.
Dataset Selection default=`Train Dataset`	Data used to display a model's predictions Select All Data, Train Dataset, Validation Dataset, or Test Dataset. Visual Notebooks displays a selected model's predictions on the dataset selected with this field. If you select Cross-validation for the Select test and validation method field, the Validation Dataset option is unavailable.
Include all columns default=off	Whether to include all columns in the predictions table Select Toggle this to include all columns in the predictions table, including the columns that you did not use as features for the model. By default, only columns you selected as features will be included.

Node Inputs/Outputs

Input	A Visual Notebooks dataframe
Output	Trained random forest regression models and a dataframe with predictions on the training data

Example models

Example model predictions

Figure 1: Example output

Examples

The dataframe shown in Figure 2 contains identifying characteristics of over 300 penguins. This data is used to train a model that can predict a penguin's body mass given its bill length, bill depth, and flipper length. This is a regression problem because you are trying to predict a continuous, numeric value.

Example input data

Figure 2: Example input data

Follow the steps below to train a model that can predict a penguin's body mass given the input data.

Connect an Random Forest Regression node to an existing node.
Select body_mass_g (Integer) for the Select Column with Labels field. The model predicts the values in this column after training.
Select Train to train models with the default settings.

Notice that Visual Notebooks trains multiple models, each with different hyperparameter configurations. All trained models are displayed on a leaderboard and ranked by performance based on the mean residual deviance of each model. Models with lower mean residual deviance values offer better predictions.

Model leaderboard

Figure 3: Model leaderboard

Follow the steps below to learn more about a specific model on the leaderboard.

Select a model, then scroll down to view information about the model and a bar chart with the importance of each feature.
Select Calculate Additional Details to view additional test metrics and a scalar regression chart. The button appears dimmed after it has been selected. For more information about test metrics, see the Visual Notebooks User Guide.

The scalar regression chart shows the model's predictions as a gray line. The actual values are displayed as blue dots. Although the model in Figure 4 does not accurately predict all values, it successfully captures the general trend of the data.

Model details

Figure 4: Model details

Follow the steps below to view the selected model's predictions.

After you select a model, navigate to the Predictions tab.
Select Calculate Predictions to view the selected model's predictions on the training data. The button appears dimmed after it has been selected.

If the leading model doesn't perform as well as you'd like it to, try altering the advanced configuration options and training new models.

Note that if your model correctly predicts all values, it might be overfit. In other words, the model may be too closely aligned to the training data to make accurate predictions on unseen data. Try altering the hyperparameters or using a different AutoML node.

The selected model's predictions

Figure 5: The selected model's predictions

Copy link to this sectionConfiguration

Copy link to this sectionAdvanced Configuration

Copy link to this sectionSelect Features

Copy link to this sectionTest and Validation Settings

Copy link to this sectionScorer and Stopping Conditions

Copy link to this sectionHyperparameters Search

Copy link to this sectionRepeatability Seed

Copy link to this sectionPrediction

Copy link to this sectionNode Inputs/Outputs

Copy link to this sectionExamples