Understanding Data Fusion Pipelines
In Data Fusion, a pipeline represents a composable flow of nodes that ingest, transform, and deliver data. Pipelines are not constrained to predefined shapes or fixed configurations. Instead, they provide a flexible canvas where you can combine source systems, transformations, and targets to match your data integration needs.
While pipelines can be assembled in many ways, you can typically begin from a small set of common setup patterns. These patterns represent frequently used starting points, not exclusive pipeline types. You can freely add, remove, or reorder nodes as your use case evolves.
The Data Fusion Pipeline Model
At a high level, every pipeline is built from the same building blocks:
Source nodes
- Define where data originates (files, streams, databases, document repositories).
Processing nodes
- Apply filtering, mapping, transformation, parsing, enrichment, or embedding logic.
Target nodes
- Persist data into entities, or downstream systems.
The platform does not limit the number or combination of nodes in a pipeline. Multiple transforms and parallel branches are supported within a pipeline. Structured and unstructured flows are supported, but must be implemented in separate pipelines.
Common Pipeline Setup Patterns
The following patterns illustrate typical ways users start configuring pipelines. They are provided as guidance and examples, not as exhaustive or restrictive options.
Structured Data Pipelines
Structured pipelines are commonly used when ingesting tabular or schema-defined data.
File-based Structured Ingestion
Typically begins with a file source, followed by transformation and mapping into structured entities.
Common use cases:
- Ingesting structured files such as CSV, JSON, Parquet, or Avro
- Performing batch ingestion of structured data from file systems
Cloud or Stream-based Structured Ingestion
Starts from a streaming or cloud source and continuously processes incoming records into structured entities in near real time.
Common use cases:
Ingesting event streams (for example, message brokers or streaming platforms)
Processing cloud-native data feeds that emit structured records over time
External Database Access - Virtual (no ingestion)
Connects to a SQL source system to query data directly without ingestion.
Common use cases:
Accessing data through virtual types for real-time querying
Using reference or lookup data directly from the source system
External Database Ingestion (With CDC)
Enables change data capture (CDC) on a SQL source system, allowing incremental inserts and updates to flow through the pipeline.
Common use cases:
Near-real-time propagation of inserts and updates from source systems
Incremental data processing based on detected changes
Unstructured Data Pipelines
Unstructured pipelines are designed for document-centric workflows where data does not conform to a fixed schema.
These pipelines commonly include:
File or document sources
Document parsers
Metadata extractors
Entity extraction
Optional embedding models
Target entities for downstream analytics or search
You may include metadata extraction alone, or combine metadata extraction with embedding generation, depending on your use case.
Flexibility Beyond the Examples
These patterns are meant to illustrate common entry points, not define limits. In practice, you can:
Add any number of transformation nodes
Extend or reshape pipelines as requirements change
Data Fusion supports free-form composition, allowing pipelines to evolve with your data and business needs.