Understanding Unstructured Data Integration (UDI) Pipelines

Data Fusion is in Beta. Please contact your C3 AI representative to enable this feature.

In C3 AI Studio, Unstructured Data Integration (UDI) pipelines are managed under Data Fusion and enable you to ingest, process, and transform unstructured data, such as documents, PDFs into vector embeddings that can be stored, searched, and retrieved efficiently.

While a Structured DI pipeline starts with a SourceSystem and processes tabular data, a UDI pipeline starts from a Vector Store and is designed for file-based connectors that handle unstructured content.

Key Differences between Structured DI and Unstructured DI

The following table summarizes the key functional differences between structured and unstructured data integration in Data Fusion.

Aspect	Structured DI	Unstructured DI (UDI)
Entry Point	`SourceSystem`	Vector Store
Data Type	Structured/tabular data (tables, CSV, DB)	Unstructured data (PDF)
Pipeline Components	`SourceSystem` → `SourceCollection` → `Source` → `Canonical` → `Entity`	`SourceSystem` → `SourceCollection` → `Document Processor` → `Metadata Extractor (optional)` → `Embedder` → `Vector Store` → `Target Entity`
Goal	Load structured data into entities	Extract text, generate embeddings, and store them for semantic retrieval
Graph Visibility	Always visible when `SourceSystem` is defined	Requires `Vector Store` instance to appear

Supported Connectors

The Unstructured Data Integration (UnstructuredDI) capability currently supports only external file-based connectors. These connectors allow you to configure and integrate data stored in file systems or object storage environments directly from the Data Integration perspective.

Supported connector types include:

Amazon S3
Azure Blob Storage
Google Cloud Storage
Other external file system connectors available in your environment

Database, data warehouse, and streaming connectors (such as Snowflake, BigQuery, or Kafka) are not supported for UnstructuredDI pipelines in this release. Support for additional connector types may be added in future versions.

UDI Pipeline Components

This section describes each component in the Unstructured Data Integration (UDI) pipeline and its role in the ingestion flow.

Document Processor (node) ↔ GenaiCore.Unstructured.Processor An ordered list of steps, each a ProcessorComponent such as parser, chunker, or formatter.
It is recommended to use predefined processors exposed by the GenaiCore platform rather than creating new ones manually. These predefined processors are optimized for compatibility across UDI pipeline components.
Metadata Extractor (node, optional) ↔ a ProcessorComponent that adds or enriches fields.
Modifying or replacing a document processor can affect or remove the metadata extractor if the new processor is not configured to support metadata extraction.
Embedder (node) ↔ GenaiCore.Embedder.Engine
For example, uses a Hugging Face model runtime for generating embeddings.
Vector Store (node) ↔ GenaiCore.VectorStore.*
Persists embeddings and references the Target Entity.
Each node in the Unstructured Data Integration (UDI) canvas corresponds to an underlying C3 Type that manages the data processing logic.

Prerequisites and Setup Order

Before configuring an Unstructured Data Integration (UDI) pipeline, ensure that your application includes the core components for unstructured data processing. The Processor and Embedder types must be configured or extended to define how documents such as text, PDF, or CSV files are parsed and embedded before being ingested into the Vector Store.

Ensure that the following components are defined and available in your environment:

Node / Component	C3 Type or Concept	Description	Required
Source System	`SourceSystem`	Defines the external system or file repository (for example, S3, Azure Blob, GCS) from which unstructured data will be ingested.	✅ Yes
Source Collection	`SourceCollection`	Specifies which subset or folder within the source system will be used for ingestion.	✅ Yes
Document Processor	`GenaiCore.Unstructured.Processor`	Defines the ordered pipeline of ingestion steps (for example, parsing, chunking, formatting).	✅ Yes
Metadata Extractor (optional)	`ProcessorComponent`	Adds or enriches metadata fields before embedding. Used only if additional document-level metadata is needed.	⚙ Optional
Embedder	`GenaiCore.Embedder.Engine`	Encodes document chunks into embeddings using a model (for example, HuggingFace, OpenAI).	✅ Yes
Vector Store	`GenaiCore.VectorStore.*`	Manages the link between the generated embeddings and the target entity where they are stored. The Vector Store defines the storage path and configuration for persisting embeddings within the corresponding entity.	✅ Yes
Entity (Target)	Application-defined entity	The entity that stores the processed embeddings and document metadata.	✅ Yes
Job (Execution)	`GenaiCore.Unstructured.Job`	Runs the pipeline at scale. Not required to be preconfigured, the app automatically creates it once the pipeline is fully set up.	⚙ Automatically created

You do not manually upload data into the Vector Store. Instead, the Job ingests raw documents from the SourceSystem, processes them through your pipeline, and writes embeddings to the Vector Store.

Copy link to this sectionKey Differences between Structured DI and Unstructured DI

Copy link to this sectionSupported Connectors

Copy link to this sectionUDI Pipeline Components

Copy link to this sectionPrerequisites and Setup Order

Key Differences between Structured DI and Unstructured DI

Supported Connectors

UDI Pipeline Components

Prerequisites and Setup Order