Understanding Unstructured Data Integration (UDI) Pipelines
Data Fusion is in Beta. Please contact your C3 AI representative to enable this feature.
In C3 AI Studio, Unstructured Data Integration (UDI) pipelines are managed under Data Fusion and enable you to ingest, process, and transform unstructured data, such as documents, PDFs into vector embeddings that can be stored, searched, and retrieved efficiently.
While a Structured DI pipeline starts with a SourceSystem and processes tabular data, a UDI pipeline starts from a Vector Store and is designed for file-based connectors that handle unstructured content.
Key Differences between Structured DI and Unstructured DI
The following table summarizes the key functional differences between structured and unstructured data integration in Data Fusion.
| Aspect | Structured DI | Unstructured DI (UDI) |
|---|---|---|
| Entry Point | SourceSystem | Vector Store |
| Data Type | Structured/tabular data (tables, CSV, DB) | Unstructured data (PDF) |
| Pipeline Components | SourceSystem → SourceCollection → Source → Canonical → Entity | SourceSystem → SourceCollection → Document Processor → Metadata Extractor (optional) → Embedder → Vector Store → Target Entity |
| Goal | Load structured data into entities | Extract text, generate embeddings, and store them for semantic retrieval |
| Graph Visibility | Always visible when SourceSystem is defined | Requires Vector Store instance to appear |
Supported Connectors
The Unstructured Data Integration (UnstructuredDI) capability currently supports only external file-based connectors. These connectors allow you to configure and integrate data stored in file systems or object storage environments directly from the Data Integration perspective.
Supported connector types include:
Amazon S3
Azure Blob Storage
Google Cloud Storage
Other external file system connectors available in your environment
Database, data warehouse, and streaming connectors (such as Snowflake, BigQuery, or Kafka) are not supported for UnstructuredDI pipelines in this release. Support for additional connector types may be added in future versions.
UDI Pipeline Components
This section describes each component in the Unstructured Data Integration (UDI) pipeline and its role in the ingestion flow.
Document Processor (node) ↔
GenaiCore.Unstructured.ProcessorAn ordered list of steps, each aProcessorComponentsuch as parser, chunker, or formatter.It is recommended to use predefined processors exposed by the
GenaiCoreplatform rather than creating new ones manually. These predefined processors are optimized for compatibility across UDI pipeline components.Metadata Extractor (node, optional) ↔ a
ProcessorComponentthat adds or enriches fields.Modifying or replacing a document processor can affect or remove the metadata extractor if the new processor is not configured to support metadata extraction.
Embedder (node) ↔
GenaiCore.Embedder.Engine
For example, uses a Hugging Face model runtime for generating embeddings.Vector Store (node) ↔
GenaiCore.VectorStore.*
Persists embeddings and references the Target Entity.Each node in the Unstructured Data Integration (UDI) canvas corresponds to an underlying C3 Type that manages the data processing logic.
Prerequisites and Setup Order
Before configuring an Unstructured Data Integration (UDI) pipeline, ensure that your application includes the core components for unstructured data processing. The Processor and Embedder types must be configured or extended to define how documents such as text, PDF, or CSV files are parsed and embedded before being ingested into the Vector Store.
Ensure that the following components are defined and available in your environment:
| Node / Component | C3 Type or Concept | Description | Required |
|---|---|---|---|
| Source System | SourceSystem | Defines the external system or file repository (for example, S3, Azure Blob, GCS) from which unstructured data will be ingested. | ✅ Yes |
| Source Collection | SourceCollection | Specifies which subset or folder within the source system will be used for ingestion. | ✅ Yes |
| Document Processor | GenaiCore.Unstructured.Processor | Defines the ordered pipeline of ingestion steps (for example, parsing, chunking, formatting). | ✅ Yes |
| Metadata Extractor (optional) | ProcessorComponent | Adds or enriches metadata fields before embedding. Used only if additional document-level metadata is needed. | ⚙ Optional |
| Embedder | GenaiCore.Embedder.Engine | Encodes document chunks into embeddings using a model (for example, HuggingFace, OpenAI). | ✅ Yes |
| Vector Store | GenaiCore.VectorStore.* | Manages the link between the generated embeddings and the target entity where they are stored. The Vector Store defines the storage path and configuration for persisting embeddings within the corresponding entity. | ✅ Yes |
| Entity (Target) | Application-defined entity | The entity that stores the processed embeddings and document metadata. | ✅ Yes |
| Job (Execution) | GenaiCore.Unstructured.Job | Runs the pipeline at scale. Not required to be preconfigured, the app automatically creates it once the pipeline is fully set up. | ⚙ Automatically created |
You do not manually upload data into the Vector Store. Instead, the Job ingests raw documents from the SourceSystem, processes them through your pipeline, and writes embeddings to the Vector Store.