C3 AI Documentation Home

Python Estimator

Configure and train a machine learning model using the scikit-learn Python library in Visual Notebooks. This node is an estimator node, meaning it outputs a fitted model that can be used in an ML Pipeline to generate predictions on test data. The Python Estimator node is useful if you need access to the breadth of models offered by scikit-learn or wider control of model parameters.

Click here to learn more about scikit-learn estimators.

Configuration

FieldDescription
Name default=noneField to name the node A user-specified node name, displayed in the canvas and in the dataframe as a tab.
Columns RequiredFeature columns Select feature columns to be used in model training. Column names are stored in a list and can be accessed through the feature_columns variable in the notebook.
Training Label RequiredLabel column Select a column of labels to be used in model training. The column name can be accessed through the label_column variable in the notebook.

Function Definitions

FunctionDescription
result_schema(input_schema)Schema of columns added to input dataframe Specify the schema, including name and data type, of any column(s) added to the input dataframe. Typically, this is done for a column of predictions. Columns should be appended to input_schema, which is a list containing the schema of the input dataframe.
train(df, feature_columns, label_column)Fits a model to training data Configure and fit an estimator, or machine learning model, to the training data. The fit() method must be called to generate the fitted model, which is returned.
process(trained_model, df, feature_columns, prediction)Adds predictions to input dataframe Generate predictions by applying the predict() method to the trained model. Predictions should be appended to the input dataframe, which is returned.

Node Inputs/Outputs

InputA Visual Notebooks dataframe
OutputA dataframe, typically with a column of predictions included

Example dataframe output

Figure 1: Example dataframe output

Examples

The data shown in Figure 2 is used in this example. It contains data on electricity consumption from DAEWOO Steel Co., Ltd, a steel producer in Gwangyang, South Korea1. We would like to train a model that can predict energy usage one timestep in advance.

Example input data

Figure 2: Example input data

  1. Connect a Python Estimator node to an existing node. In this case, it is connected to a CSV node with the example data provided.
  2. Select all columns except for date (Timestamp) and lead_1_Usage_kWh_scaled (Double) in Columns.
  3. Select lead_1_Usage_kWh_scaled (Double) in Training Label.
  4. Copy the code below and paste it into the Notebook tab, as shown in Figure 3.
Text
from sklearn.ensemble import GradientBoostingRegressor

def result_schema(input_schema):
    input_schema.extend([['Predictions', 'float64']])
    return input_schema

def train(df, feature_columns, label_column):
    t_m = GradientBoostingRegressor(random_state=0)
    model = t_m.fit(df[feature_columns], df[label_column])
    return model

def process(trained_model, df, feature_columns, prediction):
    prediction = trained_model.predict(df[feature_columns])
    df['Predictions'] = prediction
    return df
  1. Click Run.

Notebook with filled in functions

Figure 3: Notebook with filled in functions

Note that the output dataframe is identical to the one in Figure 1. The gradient boosting regressor is used to fit a model to the training data, and generate a column of predictions that can be used in subsequent analysis.

1Dua, D. and Graff, C. (2019).UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science

Was this page helpful?