C3 AI Documentation Home

Understanding Unstructured Data Integration (UDI) Pipelines

In C3 AI Studio, Unstructured Data Integration (UDI) pipelines are managed under Data Fusion and enable you to ingest, process, and transform unstructured data, such as documents, PDFs into vector embeddings that can be stored, searched, and retrieved efficiently.

While a Structured DI pipeline starts with a SourceSystem and processes tabular data, a UDI pipeline starts from a Vector Store and is designed for file-based connectors that handle unstructured content.

Key Differences between Structured DI and Unstructured DI

The following table summarizes the key functional differences between structured and unstructured data integration in Data Fusion.

AspectStructured DIUnstructured DI (UDI)
Entry PointSourceSystemVector Store
Data TypeStructured/tabular data (tables, CSV, DB)Unstructured data (PDF)
Pipeline ComponentsSourceSystemSourceCollectionSourceCanonicalEntitySourceSystemSourceCollectionDocument ProcessorMetadata Extractor (optional)EmbedderVector StoreTarget Entity
GoalLoad structured data into entitiesExtract text, generate embeddings, and store them for semantic retrieval
Graph VisibilityAlways visible when SourceSystem is definedRequires Vector Store instance to appear

Supported Connectors

The Unstructured Data Integration (UnstructuredDI) capability currently supports only external file-based connectors. These connectors allow you to configure and integrate data stored in file systems or object storage environments directly from the Data Integration perspective.

Supported connector types include:

  • Amazon S3

  • Azure Blob Storage

  • Google Cloud Storage

  • Other external file system connectors available in your environment

UDI Pipeline Components

This section describes each component in the Unstructured Data Integration (UDI) pipeline and its role in the ingestion flow.

  • Document Processor (node)GenaiCore.Unstructured.Processor An ordered list of steps, each a ProcessorComponent such as parser, chunker, or formatter.

  • Metadata Extractor (node, optional) ↔ a ProcessorComponent that adds or enriches fields.

  • Embedder (node)GenaiCore.Embedder.Engine
    For example, uses a Hugging Face model runtime for generating embeddings.

  • Vector Store (node)GenaiCore.VectorStore.*
    Persists embeddings and references the Target Entity.

    Each node in the Unstructured Data Integration (UDI) canvas corresponds to an underlying C3 Type that manages the data processing logic.

Prerequisites and Setup Order

Before configuring an Unstructured Data Integration (UDI) pipeline, ensure that your application includes the core components for unstructured data processing. The Processor and Embedder types must be configured or extended to define how documents such as text, PDF, or CSV files are parsed and embedded before being ingested into the Vector Store.

Ensure that the following components are defined and available in your environment:

Node / ComponentC3 Type or ConceptDescriptionRequired
Source SystemSourceSystemDefines the external system or file repository (for example, S3, Azure Blob, GCS) from which unstructured data will be ingested.✅ Yes
Source CollectionSourceCollectionSpecifies which subset or folder within the source system will be used for ingestion.✅ Yes
Document ProcessorGenaiCore.Unstructured.ProcessorDefines the ordered pipeline of ingestion steps (for example, parsing, chunking, formatting).✅ Yes
Metadata Extractor (optional)ProcessorComponentAdds or enriches metadata fields before embedding. Used only if additional document-level metadata is needed.⚙ Optional
EmbedderGenaiCore.Embedder.EngineEncodes document chunks into embeddings using a model (for example, HuggingFace, OpenAI).✅ Yes
Vector StoreGenaiCore.VectorStore.*Manages the link between the generated embeddings and the target entity where they are stored. The Vector Store defines the storage path and configuration for persisting embeddings within the corresponding entity.✅ Yes
Entity (Target)Application-defined entityThe entity that stores the processed embeddings and document metadata.✅ Yes
Job (Execution)GenaiCore.Unstructured.JobRuns the pipeline at scale. Not required to be preconfigured, the app automatically creates it once the pipeline is fully set up.⚙ Automatically created
Was this page helpful?