One Hot Encoding
Use One Hot Encoding to improve machine learning models by preprocessing a collection of grouped data (categorical data variables) for machine learning models.
One hot encoding is a process of converting categorical data variables so they can be provided to machine learning algorithms. It is an effective data transformation, conversion, and preprocessing technique that helps models understand the data better. Better representation of data groups improves the accuracy and learning of the models being created.
Configuration
| Field | Description |
|---|---|
| Name default=none | Field to name the node An optional user-specified node name displayed in the workspace, both on the node and in the dataframe as a tab. |
| Columns Required | Select columns for encoding Columns with less than 100 unique values can be selected for encoding. The columns can be strings, integers, or doubles. |
| Output column suffix Required | Enter or select a suffix for encoded column(s) Enter a suffix to append to the end of the encoded column name or select a suffix you've used before in the dropdown menu. |
Invalid labels handling default=skip | How to handle invalid labels Select skip, keep, or enter error into cells with invalid labels. |
Drop Original Column(s) default=On | Toggle to handle original column Keep this toggle switch on to drop the original column, or toggle off to keep the original column. The renamed encoded column remains in the dataset. |
Node Inputs/Outputs
| Input | A Visual Notebooks dataframe |
|---|---|
| Output | A dataframe with one hot encoded columns |

Figure 1: Example dataframe output
Examples
In this example, we have a dataset used for penguin analysis. There is data collected about different species of penguins on different islands. The species and island information is provided as strings. However, if these strings are in an ordered (numbered) or unordered list (alphabetically or reverse-alphabetically), an algorithm can assign significance or weight to the items in the order in which they appear when there is no meant significance.
The One Hot Encoding node converts the strings in the columns to arrays of 0s and 1s. By encoding these columns, the criteria in our analysis can be used without placing more importance on one species or island over another.

Figure 2: Example input data
- Connect a One Hot Encoding node to an existing node. In this case, it is connected to a CSV node with the Penguin Analysis file.
- Optionally, name the One Hot Encoding node. In the example, the node is named "Encoding."
- Select the column you'd like to encode. In Figure 3, the island (String) column is selected.
- Enter a suffix you'd like to assign to the encoded column, or select a suffix from the dropdown menu with suffixes you've previously used in Visual Notebooks. In Figure 3, "cluster" is added as a suffix.
- Select your choice for handling invalid labels. Figure 3 shows the skip handling option.
- The default Drop Original Column(s) remains on in Figure 3.
Figure 3 shows the dataframe with the island column encoded.

Figure 3: Example dataframe with island column encoded
We can also encode a second column. For Figure 4, the following selections have been made:
- Column: island (String) and species (String)
- Output column suffix: "cluster"
- Invalid labels handling: error
- Drop Original Column(s): toggle is on
This dataset shows two columns encoded, adding a "cluster" suffix to the new columns, and removing the original columns.

Figure 4: Example dataframe with island and species encoded