C3 AI Documentation Home

Understanding Change Data Capture (CDC) in Data Fusion

Change Data Capture (CDC) in C3 AI Studio enables incremental ingestion of changes from SQL source tables. Instead of reprocessing the entire dataset, CDC retrieves only rows whose values have changed since the previous pipeline run.

In Data Fusion 8.10, CDC is supported only for SQL-based data sources and requires one or more monotonically increasing columns, typically a timestamp column, to detect new or updated records.

CDC operates during pipeline execution. When the pipeline runs—either manually or on a configured schedule—the system retrieves rows where the tracking column value is greater than the value processed in the previous run.

When CDC is enabled, the pipeline processes data incrementally and stages the extracted data for downstream processing.

Enabling CDC

To enable CDC, configure the following:

  • A change detection column (a monotonically increasing field, typically a timestamp)

  • An optional integration schedule to automate pipeline execution

Change Behavior

CDC retrieves rows based on tracking column comparisons and does not distinguish between operation types.

  • Inserts and updates are captured when they result in a new tracking column value

  • Deletes are not captured, as no new tracking value is generated

Notes on Execution and Data Handling

  • CDC processing occurs only when the pipeline is executed (manually or on schedule)

  • Data is processed incrementally based on the last recorded checkpoint

  • Extracted data is staged for downstream ingestion and transformation

Supported Capabilities

Source Types

CDC is supported only for SQL-based connectors, including:

  • Snowflake
  • PostgreSQL
  • SQL Server
  • MySQL

Change Detection

CDC identifies changes using one or more monotonically increasing ordering fields configured for the Source Collection.

Typically, this includes:

  • A timestamp column (for example, updated_at or last_modified)
  • An optional ID column to ensure deterministic ordering when timestamps are identical

During pipeline execution, rows are retrieved when their ordering field values are greater than the last processed checkpoint.

Change Behavior

CDC does not track database operation types (INSERT, UPDATE, DELETE). Instead, it detects rows with newer ordering column values.

  • Newly inserted rows are captured when they have a new ordering value
  • Updated rows are captured when the ordering column is updated
  • Delete operations are not captured, as no new ordering value is generated

Data Handling

When CDC is enabled, the pipeline processes data incrementally and stages the extracted data for downstream processing.

Pipeline Execution

CDC processing occurs when the Data Integration pipeline runs:

  • Manual execution — triggered by the user
  • Scheduled execution — runs automatically based on a configured integration schedule

Each run processes only the data that has changed since the previous checkpoint.

UI Support

  • CDC is configured through the Data Fusion interface
  • CDC requires at least one monotonically increasing column to be specified for change detection

Prerequisite: Source Table Must Include a Monotonically Increasing Column

Before you can enable Change Data Capture (CDC) for a source table, the table should include a column whose values can be used to track the order of changes over time.

Typically, this column contains timestamp or datetime values, but any column with monotonically increasing or otherwise orderable values (such as an auto-incrementing integer) can also be used. This column is configured as the change tracking column, which the system uses to detect and synchronize new or updated rows from the source into the platform.

CDC compares the values in this column against the last processed value from previous runs to determine which records have changed since the last ingestion.

If a suitable ordered column is not available, CDC may still run, but the system cannot guarantee the uniqueness or correctness of the incremental data captured in the resulting target entity.

Example of a Valid Schema for CDC

Column NameData TypeDescription
IDINTEGERPrimary key for the row
CUSTOMER_NAMESTRINGName of the customer
LAST_UPDATED_TSTIMESTAMPMonotonically increasing column indicating when the row was last modified

In this example, LAST_UPDATED_TS is used as the change detection (ordering) column for CDC.

Delete Operations Not Propagated Through CDC

In Data Fusion, Change Data Capture (CDC) identifies changes by comparing the values of a configured tracking column (typically a timestamp such as last_updated_ts) against the last processed checkpoint.

During each pipeline run, CDC retrieves all rows where the tracking column value is greater than the last recorded checkpoint. This typically includes newly inserted rows and updated rows where the tracking column has changed.

CDC does not distinguish between insert and update operations, nor does it detect the type of operation performed in the source system. It only evaluates whether a row has a newer tracking column value.

As a result, delete operations are not captured. If a row is deleted in the source table, the corresponding record in downstream systems remains unchanged, since no new tracking value is generated for deleted records.

This behavior is expected, as CDC in Data Fusion is based on incremental comparison of tracking column values rather than full state synchronization.

If your use case requires handling deletions, consider one of the following approaches:

  • Use full-table reload integration to periodically reconcile the dataset

  • Implement downstream reconciliation or cleanup logic to remove stale records

Checkpoint Reset and Historical Data Recovery

CDC relies on a checkpoint that tracks the most recent value of the configured ordering column processed during the previous pipeline run.

During normal operation, only rows with tracking column values greater than this checkpoint are retrieved. As a result, if updates were missed in earlier runs (for example, due to pipeline failures or late-arriving data), those records will not be captured in subsequent runs because their tracking values are less than or equal to the stored checkpoint.

To recover such data, users can reset the checkpoint and remove previously ingested data before re-running the pipeline. This forces the system to reprocess data from an earlier point in time, allowing it to capture historical records that were not ingested in prior runs.

Transformations and CDC

In Data Fusion, transformations are part of the data integration pipeline and are not exclusive to Change Data Capture (CDC).

When CDC is enabled for a Source Collection, the pipeline operates in a staged ingestion mode where data is read incrementally and written to a target entity. In this mode, at least one transformation must be configured to map data from the source (external entity) to the target schema.

These transformations define how source fields are mapped, renamed, or reshaped before being written to the staging layer.

When CDC is not enabled, the Source Collection provides direct access to the source data through a system-generated external entity. In this case:

  • Data is accessed virtually from the source system

  • No staging or materialized target is created

  • Transformations are not required

The output reflects the raw schema and column structure of the source table.

How Incremental Ingestion Works in CDC

Once CDC is enabled on a SQL Source Collection, the pipeline retrieves data incrementally by comparing the values of the configured ordering (tracking) column against the last processed checkpoint.

During each pipeline run, only rows with ordering column values greater than the previous checkpoint are retrieved and processed.

Example

A source SQL table contains 10,000 rows.

CDC is configured using a monotonically increasing column (for example, LAST_UPDATED_TS).

At the time of the next pipeline run, 15 rows have newer values in the tracking column.

Result with CDC

Instead of reprocessing the full dataset, the pipeline retrieves only those 15 rows and stages them for downstream processing.

This approach reduces unnecessary data movement and improves pipeline efficiency by avoiding full-table reprocessing during each run.

Was this page helpful?