Reverse Categorical Encoding

The Reverse Categorical Encoding node in Visual Notebooks is used to remove encoding from your dataset. If you have encoded columns using Label Encoding or One Hot Encoding in your dataset for preprocessing data for machine learning algorithms, you might need to unencode those columns.

Configuration

Field	Description
Name default=none	Field to name the node An optional user-specified node name displayed in the workspace, both on the node and in the dataframe as a tab.
Columns Required	Columns for reverse categorical encoding Select columns where encoding has been applied.
Output column suffix Required	Suffix for columns with reversed encoding Create a suffix to add to all columns with reversed categorical encoding.
Drop Original Column(s) default=`on`	Original column handling Select whether to drop original column(s) with the toggle `on` (default) or to keep original column(s) with the toggle `off`.

Node Inputs/Outputs

Input	A Visual Notebooks dataframe with encoded columns and labels from the Label Encoding or One Hot Encoding node
Output	A dataframe with categorical encoding reversed

Example output dataframe

Figure 1. Example output dataframe

Examples

In this example, we have a dataset of data scientist salaries with 607 rows of data. The dataset includes employee residence country codes, salary currency, company location, and salary in USD. Three of those columns in the dataset were encoded to preprocess it for other machine learning nodes (salary_currency (String), employee_residence (String), and company location (String)). Now we have a need to unencode those encoded columns.

Example source data file

Figure 2: Example input data

Pre-Setup

Add the following nodes to the workspace. See each node for more information.

CSV node. Add the sample data to the CSV and select Run.
Label Encoding node (the example gives information for Label Encoding, but the One Hot Encoding node can be substituted)
- Connect the CSV node to the Label Encoding node
- Encode salary_currency (String), employee_residence (String), and company location (String)
- Add an _encode suffix
- Select skip for invalid labels
- Toggle the Drop Original Column(s) button on
- Select Run
Discretizer node
- Connect the Label Encoding node to the Discretizer node
- Select salary_in_usd (Integer) for the column to discretize
- Enter 5 bins
- Select K-means Discretizer for the Discretization Method
- Select Ordinal - Categorical Encoding for the Encoding
- Toggle the Keep Original Columns button on
- Enter _Discretized for the Output column suffix
- Select Run
Group and Aggregate node
- Connect the Discretizer node to the Group and Aggregate node
- Select all columns in the GroupBy Columns node
- Add an aggregation method
  - Column: salary_in_usd
  - Method: Average
- Select Run

Workspace Setup Example

Figure 3: Workspace Setup Example. Input -> Label Encoding (or One Hot Encoding) -> Feature Engineering nodes -> Reverse Encoding -> Feature Engineering

Example steps

Connect a Reverse Categorical Encoding node to an existing node. In this case, the dataset port is connected to the GroupBy and Aggregate node and the labels port is connected to the Label Encoding node.
Optionally, name the Reverse Categorical Encoding node. In the example, the node is named, Reverse Encoding.
Select the column label(s) you'd like to reverse the encoding for. In Figure 4 the following selections are made:
- salary_currency_encode
- employee_residence_encode
- company location_encode
Create a suffix to assign to the new output column(s). In Figure 4, _reverse_encoding is added as the suffix for each new column.
Toggle on the Drop Original Column(s) to keep the original column.
Select Run.

Figure 4 shows the dataframe with three new columns of unencoded information. The original encoded columns have been dropped.

Example dataframe with unencoded columns

Figure 4: Example dataframe with unencoded columns

Figure 5a, Figure 5b, and Figure 5c compare the dataset before encoding, after encoding, and after reversing the encoding.

Note: The dataframes have columns removed so that relevant columns can be viewed more easily.

Following the second row (the number column is 1), we see:

Figure 5a: The original dataset shows Machine Learning Scientist for the job title and 260000 for the salary in USD. The currency is USD, the employee residence is JP, and the company location is JP.
Figure 5b: The encoded dataset shows the same line for Machine Learning Scientist and 260000 salary in USD with 0.00 for the salary currency, and 8.00 for the employee residence and company location.
Figure 5c: The reverse encoding dataset shows Machine Learning Scientist and 260000 for the salary in USD shows USD for the salary currency, and JP for the employee residence and the company location once again.