Understanding Change Data Capture (CDC) in Data Fusion
Change Data Capture (CDC) in C3 AI Studio enables incremental ingestion of changes from SQL source tables. Instead of reprocessing the entire dataset, CDC retrieves only rows whose values have changed since the previous pipeline run.
In Data Fusion 8.10, CDC is supported only for SQL-based data sources and requires one or more monotonically increasing columns, typically a timestamp column, to detect new or updated records.
CDC operates during pipeline execution. When the pipeline runs—either manually or on a configured schedule—the system retrieves rows where the tracking column value is greater than the value processed in the previous run.
When CDC is enabled, the pipeline processes data incrementally and stages the extracted data for downstream processing.
Enabling CDC
To enable CDC, configure the following:
A change detection column (a monotonically increasing field, typically a timestamp)
An optional integration schedule to automate pipeline execution
Change Behavior
CDC retrieves rows based on tracking column comparisons and does not distinguish between operation types.
Inserts and updates are captured when they result in a new tracking column value
Deletes are not captured, as no new tracking value is generated
Notes on Execution and Data Handling
CDC processing occurs only when the pipeline is executed (manually or on schedule)
Data is processed incrementally based on the last recorded checkpoint
Extracted data is staged for downstream ingestion and transformation
Supported Capabilities
Source Types
CDC is supported only for SQL-based connectors, including:
- Snowflake
- PostgreSQL
- SQL Server
- MySQL
Change Detection
CDC identifies changes using one or more monotonically increasing ordering fields configured for the Source Collection.
Typically, this includes:
- A timestamp column (for example, updated_at or last_modified)
- An optional ID column to ensure deterministic ordering when timestamps are identical
During pipeline execution, rows are retrieved when their ordering field values are greater than the last processed checkpoint.
Change Behavior
CDC does not track database operation types (INSERT, UPDATE, DELETE). Instead, it detects rows with newer ordering column values.
- Newly inserted rows are captured when they have a new ordering value
- Updated rows are captured when the ordering column is updated
- Delete operations are not captured, as no new ordering value is generated
Data Handling
When CDC is enabled, the pipeline processes data incrementally and stages the extracted data for downstream processing.
Pipeline Execution
CDC processing occurs when the Data Integration pipeline runs:
- Manual execution — triggered by the user
- Scheduled execution — runs automatically based on a configured integration schedule
Each run processes only the data that has changed since the previous checkpoint.
UI Support
- CDC is configured through the Data Fusion interface
- CDC requires at least one monotonically increasing column to be specified for change detection
Prerequisite: Source Table Must Include a Monotonically Increasing Column
Before you can enable Change Data Capture (CDC) for a source table, the table should include a column whose values can be used to track the order of changes over time.
Typically, this column contains timestamp or datetime values, but any column with monotonically increasing or otherwise orderable values (such as an auto-incrementing integer) can also be used. This column is configured as the change tracking column, which the system uses to detect and synchronize new or updated rows from the source into the platform.
CDC compares the values in this column against the last processed value from previous runs to determine which records have changed since the last ingestion.
If a suitable ordered column is not available, CDC may still run, but the system cannot guarantee the uniqueness or correctness of the incremental data captured in the resulting target entity.
Example of a Valid Schema for CDC
| Column Name | Data Type | Description |
|---|---|---|
| ID | INTEGER | Primary key for the row |
| CUSTOMER_NAME | STRING | Name of the customer |
| LAST_UPDATED_TS | TIMESTAMP | Monotonically increasing column indicating when the row was last modified |
In this example, LAST_UPDATED_TS is used as the change detection (ordering) column for CDC.
The source table must include at least one column whose values increase when new rows are inserted or existing rows are updated.This is typically implemented using a timestamp column, but any monotonically increasing column can be used for change detection. If no such column exists, CDC cannot reliably identify new or updated records, and incremental ingestion cannot be configured for that table.
Delete Operations Not Propagated Through CDC
In Data Fusion, Change Data Capture (CDC) identifies changes by comparing the values of a configured tracking column (typically a timestamp such as last_updated_ts) against the last processed checkpoint.
During each pipeline run, CDC retrieves all rows where the tracking column value is greater than the last recorded checkpoint. This typically includes newly inserted rows and updated rows where the tracking column has changed.
CDC does not distinguish between insert and update operations, nor does it detect the type of operation performed in the source system. It only evaluates whether a row has a newer tracking column value.
As a result, delete operations are not captured. If a row is deleted in the source table, the corresponding record in downstream systems remains unchanged, since no new tracking value is generated for deleted records.
This behavior is expected, as CDC in Data Fusion is based on incremental comparison of tracking column values rather than full state synchronization.
If your use case requires handling deletions, consider one of the following approaches:
Use full-table reload integration to periodically reconcile the dataset
Implement downstream reconciliation or cleanup logic to remove stale records
Checkpoint Reset and Historical Data Recovery
CDC relies on a checkpoint that tracks the most recent value of the configured ordering column processed during the previous pipeline run.
During normal operation, only rows with tracking column values greater than this checkpoint are retrieved. As a result, if updates were missed in earlier runs (for example, due to pipeline failures or late-arriving data), those records will not be captured in subsequent runs because their tracking values are less than or equal to the stored checkpoint.
To recover such data, users can reset the checkpoint and remove previously ingested data before re-running the pipeline. This forces the system to reprocess data from an earlier point in time, allowing it to capture historical records that were not ingested in prior runs.
Checkpoint Reset and Data Reprocessing CDC uses a checkpoint to track the most recent value of the configured ordering column processed during the last pipeline run.
Transformations and CDC
In Data Fusion, transformations are part of the data integration pipeline and are not exclusive to Change Data Capture (CDC).
When CDC is enabled for a Source Collection, the pipeline operates in a staged ingestion mode where data is read incrementally and written to a target entity. In this mode, at least one transformation must be configured to map data from the source (external entity) to the target schema.
These transformations define how source fields are mapped, renamed, or reshaped before being written to the staging layer.
When CDC is not enabled, the Source Collection provides direct access to the source data through a system-generated external entity. In this case:
Data is accessed virtually from the source system
No staging or materialized target is created
Transformations are not required
The output reflects the raw schema and column structure of the source table.
How Incremental Ingestion Works in CDC
Once CDC is enabled on a SQL Source Collection, the pipeline retrieves data incrementally by comparing the values of the configured ordering (tracking) column against the last processed checkpoint.
During each pipeline run, only rows with ordering column values greater than the previous checkpoint are retrieved and processed.
Example
A source SQL table contains 10,000 rows.
CDC is configured using a monotonically increasing column (for example, LAST_UPDATED_TS).
At the time of the next pipeline run, 15 rows have newer values in the tracking column.
Result with CDC
Instead of reprocessing the full dataset, the pipeline retrieves only those 15 rows and stages them for downstream processing.
This approach reduces unnecessary data movement and improves pipeline efficiency by avoiding full-table reprocessing during each run.