C3 AI Documentation Home

Data Summarization

The data summarization feature of C3 Agentic AI Platform allows you to obtain insights about your data. These insights can help you understand patterns, distributions, and key characteristics of your datasets—no matter the size of your dataset.

When to use

  • To understand how your data "looks". How your data "looks" is determined by key characteristics such as the percentage of null values for various column, distribution, most frequent elements, number of unique elements, and more.
  • To analyze datasets that can not be physically loaded into memory. For example, it would be impossible for a data scientist to load a terabyte worth of data into a Jupyter notebook due to memory constraints. One solution would be to sample the data and act on that sample. But what if you want the entire data context? Since C3 AI's data summarization feature can handle infinite amounts of data, it allows limitless data analytics.

Why

Data summarization allows users to acquire data insights using C3 AI's ability to process data in parallel.

For data integration

Data summarization allows data integration teams to perform data quality checks. For example:

  • Have you transferred all the data?
  • Did you transfer the right data?
  • Are you missing too many values?
  • Does the distribution make sense?
  • Did data integration work?

Data quality is important to make sure that data scientists are acting on reliable data before making key decisions that could affect the final data model.

For data science

Data exploration is the process of understanding key characteristics of a dataset to judge what machine learning models are suitable. Data exploration is a pivotal part of a data scientist's workflow and can impact key decisions. Data summarization allows data scientists to perform initial data exploration by providing various statistics about the dataset in an efficient and distributed way.

How does it work

Data summarization takes advantage of the C3 Agentic AI Platform's ability to parallel process data. By dividing a dataset into chunks, you can stream objects and perform analytics in a time and memory efficient manner.

Data summarization on a file

Data summarization can be used from ObjDigest on the File type.

JavaScript
fileDigest = ObjDigest.fromFile(f)
c3Viz(fileDigest) // select fieldStats to see all fields, select again on a field to get stats for that field
fileDigest.nullPercent() //returns percentage of null entries passed into fileDigest
fileDigest.fieldStats().get("numericField").approxHistogram() // returns approximate histogram object for "numericField"
fileDigest.fieldStats().get("stringField").approxUniqueCount() //returns approximate unique count of "stringField"
fileDigest.fieldStats().get("stringField").approxMostFrequentElements() // returns approximate most frequent elements

Data summarization on a FileSourceCollection

  1. Put your data in a FileSourceCollection. For example, here you want to perform data profiling on oil data. You have a CSV file called OilTest.csv and you created a Canonical type CanonicalOilTest for our particular CSV file. After you put the Canonical Type file under the right directory and provision the package, execute the following commands:
JavaScript
FileSourceSystem.create({
    id: 'fss',
    rootUrlOverride: '<file location>'
})
// File location could be file:///data_load/ or s3:///bucket_name/
// Make sure the OilTest.csv file is located at the specified <file location>
fsc = FileSourceCollection.make({
    name: 'CanonicalOilTest',
    id: 'CanonicalOilTest',
    processMode: 'MANUAL',
    sourceSystem: 'fss',
    source: {type:"TypeRef", typeName:"CanonicalOilTest"}
})
fsc.upsert()
  1. Start a data summarization batch job. Execute the following commands:
JavaScript
// Again, make this your own Canonical type
fsc = FileSourceCollection.get("CanonicalOilTest")
job = fsc.summarize()
job.start()

Refer to c3ShowType(FileSourceCollection) for more information about FileSourceCollection.

  1. Monitor your batch job. You can monitor your batch job by calling job.status() or c3Grid(InvalidationQueue.countAll()) in the console. You can also use c3QErrs(BatchQueue) to debug errors.

  2. Review the results. You can do this with job.fetchResults() — this can return a PersistableObjDigest type with statistics. See c3ShowType(PersistableObjDigest) for details on the result type. You can also fetch previous data summarization results using PersistableObjDigest.fetch().

Was this page helpful?