Work with C3 AI Datasets

The C3 Agentic AI Platform offers a single, unified, data interface for all data operations. This data interface is called C3 AI Datasets, and is available through the Data C3 Type.

Data provides the following major functionalities:

Pandas-like APIs — Data provides APIs that are similar to Pandas, enabling users to perform data loading, data exploration, and feature engineering using familiar APIs. This minimizes the need to have any C3 AI specific knowledge to get started on the C3 Agentic AI Platform.
Execution Engine Flexibility — The C3 AI data interface is decoupled from the back-end execution engine that materializes data. For the end user, this provides the flexibility to switch between execution engines without having to modify any user-written code. Users can choose a specific execution engine based on the changing scale and functional requirements of their use case. Note that the current release only supports Panda execution mode.

You can start using C3 AI Datasets with Raw Data or with an existing Application Model.

Examples

Below are a few examples to show how closely the APIs on Data resemble Pandas.

Python

# 1. Choose Execution Engine 
c3.setDefaultExecutionMode('Pandas')

# 2. Use Pandas-like APIs to manipulate data

# Read data from a storage bucket in gcs
df = c3.Data.read_csv('gcs://wind_turbine/measurements.csv', parse_date=['timestamp'])

# Alternatively read data from a C3 Data model.
events_df = c3.WindTurbineEvent.eval()

# Resample your DataFrame to an hourly frequency 
df = df.set_index('timestamp').resample('1h').mean()
df_merged = c3.Data.merge_asof(left=df, right=events_df, direction='forward')

The platform includes C3 AI Datasets and Studio Data Catalog features, enabling an improved data science experimentation experience on large data, while enhancing data collaboration and data governance.

C3 AI Datasets are persisted tabular data stored in C3 AI Files System "Data Lake" and are based on the industry-standard Apache Iceberg open table format. C3 AI Datasets can be searched and previewed based on metadata associated with a dataset using the new Studio Data Catalog, shown as as Data Lakehouse in C3 AI Studio.

C3 AI Datasets can be loaded, created, or updated using:
- Jupyter by saving a pandas data-frame to a dataset
- From Visual Notebooks by using the save dataset feature
- Form file upload, using the UI of the new Studio Data Catalogue
- Programmatically using the C3 AI Data Interface
C3 AI Datasets created in Jupyter can be updated in Visual Notebooks and vice versa, supporting an integrated experience between data exploration and data science tools.
C3 AI Datasets can be created or updated in production without modifying package metadata, offering a powerful data science alternative to free-form CSV or unmanaged files in the file system.

Overview of Data Tables

Creating a New Table

Viewing C3 AI Datasets Metadata

C3 AI Datasets include customizable metadata descriptors useful for search and record keeping. Metadata captures standard fields, such as information about the dataset's lineage, update history, and creator, along with customizable descriptors that can be updated and or specific to a use case. The new Studio Data Catalogue enables simplified discovery of C3 AI Datasets by allowing search across metadata fields, including customized fields.
Querying or operating on C3 AI Datasets using the C3 Agentic AI Platform's Spark Execution engine offers a high degree of performance and scalability. It can be used to persist data science experiments on terabyte-scale datasets, including experimentation on large volumes of timeseries data.
- C3 AI Datasets include features to optimize query performance, such as columnar storage, compression techniques, and fast scan planning.
C3 AI Datasets are updatable and versioned. New data may be appended to a dataset, with the ability to roll back to prior versions of a dataset.
C3 AI Datasets include built-in data governance features important to customers managing sensitive data, including possibly personally identifiable information (PII).
C3 AI Datasets can be kept private or shared with colleagues using sharing with a C3 AI user group
C3 AI Datasets offer an optional approval workflow for:
- Sharing datasets, whereby a "disclosure review administrator" may approve or decline a request to share.
- Downloading a dataset to desktop, protecting against sensitive data from exfiltration.
  All sharing actions are logged, ensuring full data governance audibility and traceability.

Work with C3 AI Datasets

Examples

Datasets and Data Catalog

See also

Copy link to this sectionExamples

Copy link to this sectionDatasets and Data Catalog

Copy link to this sectionSee also

Examples

Datasets and Data Catalog

See also