C3 AI Documentation Home

One Hot Encoding

Use One Hot Encoding to improve machine learning models by preprocessing a collection of grouped data (categorical data variables) for machine learning models.

One hot encoding is a process of converting categorical data variables so they can be provided to machine learning algorithms. It is an effective data transformation, conversion, and preprocessing technique that helps models understand the data better. Better representation of data groups improves the accuracy and learning of the models being created.

Configuration

FieldDescription
Name default=noneField to name the node An optional user-specified node name displayed in the workspace, both on the node and in the dataframe as a tab.
Columns RequiredSelect columns for encoding Columns with less than 100 unique values can be selected for encoding. The columns can be strings, integers, or doubles.
Output column suffix RequiredEnter or select a suffix for encoded column(s) Enter a suffix to append to the end of the encoded column name or select a suffix you've used before in the dropdown menu.
Invalid labels handling default=skipHow to handle invalid labels Select skip, keep, or enter error into cells with invalid labels.
Drop Original Column(s) default=OnToggle to handle original column Keep this toggle switch on to drop the original column, or toggle off to keep the original column. The renamed encoded column remains in the dataset.

Node Inputs/Outputs

InputA Visual Notebooks dataframe
OutputA dataframe with one hot encoded columns

Example dataframe output

Figure 1: Example dataframe output

Examples

In this example, we have a dataset used for penguin analysis. There is data collected about different species of penguins on different islands. The species and island information is provided as strings. However, if these strings are in an ordered (numbered) or unordered list (alphabetically or reverse-alphabetically), an algorithm can assign significance or weight to the items in the order in which they appear when there is no meant significance.

The One Hot Encoding node converts the strings in the columns to arrays of 0s and 1s. By encoding these columns, the criteria in our analysis can be used without placing more importance on one species or island over another.

Example source data file

Figure 2: Example input data

  1. Connect a One Hot Encoding node to an existing node. In this case, it is connected to a CSV node with the Penguin Analysis file.
  2. Optionally, name the One Hot Encoding node. In the example, the node is named "Encoding."
  3. Select the column you'd like to encode. In Figure 3, the island (String) column is selected.
  4. Enter a suffix you'd like to assign to the encoded column, or select a suffix from the dropdown menu with suffixes you've previously used in Visual Notebooks. In Figure 3, "cluster" is added as a suffix.
  5. Select your choice for handling invalid labels. Figure 3 shows the skip handling option.
  6. The default Drop Original Column(s) remains on in Figure 3.

Figure 3 shows the dataframe with the island column encoded.

Example dataframe with default settings

Figure 3: Example dataframe with island column encoded

We can also encode a second column. For Figure 4, the following selections have been made:

  • Column: island (String) and species (String)
  • Output column suffix: "cluster"
  • Invalid labels handling: error
  • Drop Original Column(s): toggle is on

This dataset shows two columns encoded, adding a "cluster" suffix to the new columns, and removing the original columns.

Example dataframe with scaled and zero-centered data

Figure 4: Example dataframe with island and species encoded

Was this page helpful?