C3 AI Documentation Home

Group By and Aggregate

Use the Group By and Aggregate node in Visual Notebooks to group your data and see aggregate functions on data, and optionally associate the aggregations with one or more groups.

Configuration

FieldDescription
Name default=noneField to name the node An optional user-specified node name displayed in the workspace, both on the node and in the dataframe as a tab.
Select Column(s) to Group By (Optional) default=noneSelection to group the data by a column Optionally choose a column and group the data by your selection.
Select Column(s) to Aggregate *RequiredColumns to aggregate Specify which columns to aggregate.
Select Aggregation Method(s) *RequiredApply aggregation method Select an aggregation method to apply to the columns specified in Select Column(s) to Aggregate.
Add additional aggregation default=noneApply additional aggregation Select additional columns to aggregate and aggregation methods to apply to those columns.
Sort By default=noneOption to sort the data Optionally choose a column to sort by. The sort can be reversed and additional sort steps can be added.

Node Inputs/Outputs

InputA Visual Notebooks dataframe
OutputA dataframe with specified columns to group by and/or apply aggregation methods

Example dataframe output

Figure 1: Example output

Example

Analysis of cat breeds is often used in designing products and foods. In our example, we use a dataset with 700 rows of data with 4 columns about several different breeds of cats.

Example input data

Figure 2: Example input data

First, we explore how Aggregation selections work in the node.

  1. Connect the Group By and Aggregate node to an existing node. In this case, the node is connected to a CSV node with the sample data.
  2. In the Select Column(s) to Aggregate field, select the dataset columns you'd like to aggregate. In this case, height_in (Double), weight_lbs (Double), and life_span_yrs (Double) are selected.
  3. In the Select Aggregation Method(s) field, choose which aggregation method(s) you'd like to apply the selected columns. Figure 3 shows Maximum, Average, Standard Deviation, and Variance are selected.
  4. Select Run.

The dataset in Figure 3 shows 12 columns with each of the 4 aggregation methods applied on the 3 columns selected for aggregation. The resulting dataset shows a single line with the maximum, average, standard deviation, and variance of all 700 rows of data.

The full list of the available aggregation method selections is here:

  • Sum
  • Minimum
  • Maximum
  • Average
  • Count
  • Count Distinct
  • Standard Deviation
  • Variance
  • 25% Percentile
  • Median
  • 75% Percentile

Note: The Sum Distinct and Count Distinct methods apply to unique values.

Example of aggregation on selected columns

Figure 3: Example of aggregation on selected columns

Optionally, the aggregation columns can be grouped by another column.

  1. In the Select Column(s) to Group by (Optional) field, select a column to group by. Figure 4 shows the breed (String) selected to group by.
  2. Select Run.

The group by selection shows all the same aggregations by breed.

To create new aggregation methods on different columns in the same dataset, select the Add additional aggregation button.

Example of aggregations grouped by breed

Figure 4: Example of aggregations grouped by breed

Once you've completed your aggregations and grouped the data by a column, another option available in this node is to sort.

  1. In the Advanced Configuration section, select a column to sort by. In Figure 5, the dataframes is sorted by AVG_life_span_yrs.
  2. Select Run.

Notice that the dataframe is sorted by the average life span in descending order from 14.87 (norwegian forest) to 11.84 (maine coon).

Note: Some columns in Figure 5 are removed to show relevant information. If you are following along with the example, please scroll to the right to see the AVG_life_span_yrs column,

Additional sort steps can be added, and the sort order can also be reversed.

Example of dataframe sorted by average life span

Figure 5: Example of dataframe sorted by average life span

Was this page helpful?