Data Ingestion Best Practices

When working with the C3 Agentic AI Platform, data ingestion is a continuous process that accumulates metadata (for example, SourceStatus, TargetStatus) over time. In shared environments such as DEV/QA/PROD, this metadata can build up significantly, affecting the overall performance of the database and application. This topic focuses on managing metadata accumulation and maintaining optimal database performance by clearing or exporting outdated data on a regular cadence.

Managing and clearing metadata related to data ingestion is essential for maintaining the health and performance of your C3 AI Application. By proactively monitoring database size, establishing a consistent cadence for metadata cleanup, and adhering to platform-specific best practices, you can prevent performance degradation and ensure a smooth-running application environment. Make it a part of your routine operational tasks to avoid any long-term issues related to data accumulation.

Platform-specific best practices

Clearing data integration metadata helps maintain database performance in both production and development environments. Below are specific steps and guidance for investigating the state of PostgreSQL and Cassandra in the C3 Agentic AI Platform; best practices for storing data in each; and example commands for clearing old metadata.

PostgreSQL

In C3 AI, the three key data integration metadata Types are SourceStatus, TransformStatus, and TargetStatus. All extend the DataIntegStatus Type, which means they are stored in a single PostgreSQL table.

In PostgreSQL, tables with over 10 million rows begin to show slowdowns, with significant performance drops above 30 million records. Performance can become near-zero beyond 100 million rows. Use fetchCount() to monitor table sizes.

Monitor table sizes using fetchCount() – The fetchCount() method allows you to check how many records are in each of the metadata tables. An example command to count all metadata records is shown below:

JavaScript

var count = DataIntegStatus.fetchCount();
console.log(count)

If the count exceeds recommended thresholds (10M+ rows), you should consider exporting and then deleting old records.

Exporting data – In production environments, you should not delete metadata permanently. Instead, export the data to external storage (for example, AWS S3) and then delete it from PostgreSQL. The C3 Agentic AI Platform provides pre-built Export and Import jobs to save Type records as files. See Import and Export Data topic for configuring these jobs.

In Development/QA environments, it might not be necessary to save all metadata, especially if data loads have been performed multiple times, data records have been deleted, or if maintaining a log of all data loads is not required in these environments.

Clear metadata using delete operations – You can use the removeAll() method to remove specific entries from the DataIntegStatus table, or more specifically, you can choose to only clear records from SourceStatus, TransformStatus, or TargetStatus using the same APIs.

Example to delete entries older than a certain date:

JavaScript

// Removes all DataIntegStatus created before '2020-10-01'
var startDate = DateTime.make('2020-10-01');
DataIntegStatus.removeAll({
    filter: Filter.lt('meta.created', startDate)
}, true);

// Removes all SourceStatus created before '2020-10-01'
var startDate = DateTime.make('2020-10-01');
SourceStatus.removeAll({
    filter: Filter.lt('meta.created', startDate)
}, true);

Before running a removeAll on any table, it is best practice to test the filter statement in a fetch() or fetchCount() command first to ensure it returns the records you intend to delete.

JavaScript

// View DataIntegStatus created before '2020-10-01'
var startDate = DateTime.make('2020-10-01');
c3Grid(DataIntegStatus.fetch({
    filter: Filter.lt('meta.created', startDate)
}));

// Count all DataIntegStatus created before '2020-10-01'
var startDate = DateTime.make('2020-10-01');
DataIntegStatus.fetchCount({
    filter: Filter.lt('meta.created', startDate)
})

Cassandra

If you are using Cassandra as the key-value store for your C3 AI application, it is essential to adhere to best practices to ensure optimal performance. Although Cassandra is a columnar datastore, for the purposes of this document, a 'row' refers to one instance or record of a type, and a 'column' refers to one field within that record.

Data modeling considerations – Avoid exceeding 1 billion keys per table, and keep the number of records under a single key to around 100,000 to help prevent query failures. Without further optimization, 1 million records per key should be considered the highest limit.

While Cassandra can handle tables with many columns/fields, it is best practice to use narrow tables (10s of columns or less) to help prevent out-of-memory errors and other performance issues, especially during large queries. Lastly, refrain from using stored calculations on Cassandra types.

Sizing considerations – Cassandra's performance scales linearly with the number of nodes. Adding more nodes is necessary for improved performance in cases of high user demand or substantial data volume.

Regarding ring and node sizing, ensure that the disk on each node is no more than half full, which can be viewed using a default Grafana dashboard deployed with C3 AI.

Steps to view Cassandra ring usage in Grafana

To determine the percentage of disk usage in your Cassandra ring using the Grafana dashboard, follow these steps:

Log in to Grafana.
Locate the relevant dashboard for your Cassandra metrics.
Check for panels displaying disk usage data. Ensure that the dashboard is configured correctly to show metrics from your Cassandra cluster.
Visualize the disk usage to gain insights into the fullness of your ring. If needed, create custom queries to display the required metrics.

Automating the cleanup process

Set up a scheduled job or automated script that clears out old records to ensure that the system stays clean over time. You can set up periodic cleanup jobs using C3 AI's job scheduler or through external scripts that interact with the platform's APIs.

You could use a C3 AI CronJob to automate metadata cleanup using a time filter (for example, delete records older than 180 days).

JavaScript

var startDate = DateTime.now().plusDays(-180);

DataIntegStatus.removeAll({
    filter: Filter.lt('meta.created', startDate)
}, true);

This snippet deletes all metadata records older than 180 days. You could then schedule this code to run in a CronJob by wrapping it in a C3 AI Type method, and then creating a CronJob to run on a schedule using following code.

JavaScript

// In <pkg>/src/MetadataDeleter.c3typ
type MetadataDeleter {
  deleteOldObjs: function() js-server
}

// In <pkg>/src/MetadataDeleter.js
function deleteOldObjs(){
  var startDate = DateTime.now().plusDays(-180);
  DataIntegStatus.removeAll({
      filter: Filter.lt('meta.created', startDate)
  }, true);
  return;
}

// In <pkg>/seed/CronJob/scheduledMetadataDeleter.json
{
  "id": "scheduledMetadataDeleter",
  "name": "scheduledMetadataDeleter",
  "description": "Deletes all DataIntegStatus older than 180 days. Runs on the first of each month at midnight UTC.",
  "action": {
    "typeName": "MetadataDeleter",
    "actionName": "deleteOldObjs"
  },
  "inactive": false,
  "concurrent": false,
  "runOnLeader": true,
  "scheduleDef": {
    "cronExpression": "0 0 0 1 * ? *",
    "skipOverdue": true
  }
}

To run this code after an Export job in production, you can utilize a Workflow to schedule this Type method to run after the Export job has completed.

Conclusion

Clearing metadata is critical for maintaining the performance of your C3 AI environment, especially in long-running applications with high data ingestion rates. Monitor and clear metadata in both PostgreSQL and Cassandra environments to avoid slowdowns and storage overflows. For production systems, archive data before deletion to ensure compliance with data governance and audit requirements.

Copy link to this sectionPlatform-specific best practices

Copy link to this sectionPostgreSQL

Copy link to this sectionCassandra

Copy link to this sectionSteps to view Cassandra ring usage in Grafana

Copy link to this sectionAutomating the cleanup process

Copy link to this sectionConclusion

Copy link to this sectionSee also

Platform-specific best practices

PostgreSQL

Cassandra

Steps to view Cassandra ring usage in Grafana

Automating the cleanup process

Conclusion

See also