Data Summarization

The data summarization feature of C3 Agentic AI Platform allows you to obtain insights about your data. These insights can help you understand patterns, distributions, and key characteristics of your datasets—no matter the size of your dataset.

When to use

To understand how your data "looks". How your data "looks" is determined by key characteristics such as the percentage of null values for various column, distribution, most frequent elements, number of unique elements, and more.
To analyze datasets that can not be physically loaded into memory. For example, it would be impossible for a data scientist to load a terabyte worth of data into a Jupyter notebook due to memory constraints. One solution would be to sample the data and act on that sample. But what if you want the entire data context? Since C3 AI's data summarization feature can handle infinite amounts of data, it allows limitless data analytics.

Why

Data summarization allows users to acquire data insights using C3 AI's ability to process data in parallel.

For data integration

Data summarization allows data integration teams to perform data quality checks. For example:

Have you transferred all the data?
Did you transfer the right data?
Are you missing too many values?
Does the distribution make sense?
Did data integration work?

Data quality is important to make sure that data scientists are acting on reliable data before making key decisions that could affect the final data model.

For data science

Data exploration is the process of understanding key characteristics of a dataset to judge what machine learning models are suitable. Data exploration is a pivotal part of a data scientist's workflow and can impact key decisions. Data summarization allows data scientists to perform initial data exploration by providing various statistics about the dataset in an efficient and distributed way.

How does it work

Data summarization takes advantage of the C3 Agentic AI Platform's ability to parallel process data. By dividing a dataset into chunks, you can stream objects and perform analytics in a time and memory efficient manner.