C3 AI Documentation Home

Understanding Data Fusion Pipelines

In Data Fusion, a pipeline represents a composable flow of nodes that ingest, transform, and deliver data. Pipelines are not constrained to predefined shapes or fixed configurations. Instead, they provide a flexible canvas where you can combine source systems, transformations, and targets to match your data integration needs.

While pipelines can be assembled in many ways, you can typically begin from a small set of common setup patterns. These patterns represent frequently used starting points, not exclusive pipeline types. You can freely add, remove, or reorder nodes as your use case evolves.

The Data Fusion Pipeline Model

At a high level, every pipeline is built from the same building blocks:

  • Source nodes

    • Define where data originates (files, streams, databases, document repositories).
  • Processing nodes

    • Apply filtering, mapping, transformation, parsing, enrichment, or embedding logic.
  • Target nodes

    • Persist data into entities, or downstream systems.

The platform does not limit the number or combination of nodes in a pipeline. Multiple transforms and parallel branches are supported within a pipeline. Structured and unstructured flows are supported, but must be implemented in separate pipelines.

Common Pipeline Setup Patterns

The following patterns illustrate typical ways users start configuring pipelines. They are provided as guidance and examples, not as exhaustive or restrictive options.

Structured Data Pipelines

Structured pipelines are commonly used when ingesting tabular or schema-defined data.

File-based Structured Ingestion

Typically begins with a file source, followed by transformation and mapping into structured entities.

Common use cases:

  • Ingesting structured files such as CSV, JSON, Parquet, or Avro
  • Performing batch ingestion of structured data from file systems

Cloud or Stream-based Structured Ingestion

Starts from a streaming or cloud source and continuously processes incoming records into structured entities in near real time.

Common use cases:

  • Ingesting event streams (for example, message brokers or streaming platforms)

  • Processing cloud-native data feeds that emit structured records over time

External Database Access - Virtual (no ingestion)

Connects to a SQL source system to query data directly without ingestion.

Common use cases:

  • Accessing data through virtual types for real-time querying

  • Using reference or lookup data directly from the source system

External Database Ingestion (With CDC)

Enables change data capture (CDC) on a SQL source system, allowing incremental inserts and updates to flow through the pipeline.

Common use cases:

  • Near-real-time propagation of inserts and updates from source systems

  • Incremental data processing based on detected changes

Unstructured Data Pipelines

Unstructured pipelines are designed for document-centric workflows where data does not conform to a fixed schema.

These pipelines commonly include:

  • File or document sources

  • Document parsers

  • Metadata extractors

  • Entity extraction

  • Optional embedding models

  • Target entities for downstream analytics or search

You may include metadata extraction alone, or combine metadata extraction with embedding generation, depending on your use case.

Flexibility Beyond the Examples

These patterns are meant to illustrate common entry points, not define limits. In practice, you can:

  • Add any number of transformation nodes

  • Extend or reshape pipelines as requirements change

Data Fusion supports free-form composition, allowing pipelines to evolve with your data and business needs.

Was this page helpful?