Configure the Document Processing Pipeline
C3 Generative AI comes out-of-the-box with a data integration pipeline for syncing, chunking, embedding, and persisting the contents from supported text documents.
Connect to remote source systems
The system supports integrations with each of the major cloud providers' storage services, namely Amazon S3, Google Cloud Storage, and Azure Blob Storage. For more information on connecting to remote file systems, see the following section of the C3 AI Data Integration guide.
The Genai.SourceSystem types (Genai.SourceSystem.S3, Genai.SourceSystem.Azure, Genai.SourceSystem.Gcs) were deprecated in 8.9. Use the C3 AI platform Data Integration (DI) pipeline to connect remote storage instead.
Add external sources
The Genai.SourceCollection Type is used to model external data sources within the application. A Genai.SourceCollection stores a reference to an external file system path, and can be synced to store references to each file in the remote file system. These references are stored in the Genai.SourceFile Type.
A default source collection (default-app-collection) is seeded automatically and linked to the default unstructured pipeline. Additional collections can be created to:
- Map ACL permissions to specific users or groups based on the directory structure of a remote file system
- Independently manage sync and process schedules for different collections of source data
Each source collection should reference a Genai.UnstructuredPipeline that defines how documents are processed (chunking, parsing, metadata extraction, indexing). See Configuring Unstructured Pipelines for details on creating and configuring pipelines for your collections.
Define access control lists (optional)
Apply access control lists (ACLs) to control which user groups have access to a Genai.SourceFile. All users have access to a file by default. To restrict access to a specific source file, run the following code:
var sfId = "<source_file_id>";
var groupsWithAccess = ["<group_name1>", "<group_name2>", ...];
var sf = Genai.SourceFile.forId(sfId);
sf.withGroups(groupsWithAccess).merge();After the ACL has been updated, access to the contents of the restricted Genai.SourceFile will be restricted at retrieval time to users who are in at least one of the allowed groups.
For full ACL setup instructions, see Other Capabilities.
Specify chunking behavior
Before computing embeddings for a document, documents are chunked using a Genai.SourceFile.Chunker into Genai.SourcePassages. By default, the Genai.SourceFile.Chunker.UniversalChunker.Spec is used, which selects the most appropriate chunking implementation based on the file extension of the document. You can override the default configuration of the universal chunking component to use a specific chunking implementation for specific file extensions.
Trigger unstructured data integration
Syncing a Genai.SourceCollection creates Genai.SourceFile instances, which store references to the external source documents:
Genai.SourceCollection.forId('default').sync();Alternatively, if there are multiple Genai.SourceCollection instances, you can sync them all with:
Genai.SourceCollection.Utils.sync();Some metadata, such as the file type and the source collection it belongs to, are automatically captured when a Genai.SourceFile is created from the sync process. Optionally, additional Genai.SourceFile.Metadata can be populated to a Genai.SourceCollection’s source files that are created during the sync operation. To extract metadata, use the Genai.SourceCollection#syncMetadataLambda field to define a lambda function on the source collection, allowing for arbitrary control over the metadata fields that are set during ingest. This lambda function takes a Genai.SourceFile instance as an input, and outputs a Genai.SourceFile.Metadata instance.
For instance, let's say you want to fill the author of a document on ingest, and want to apply the static value "Joe Blogs" as the author for every source file that is ingested for a given source collection. You could define the following:
gsc
.withField(
'syncMetadataLambda',
Lambda.fromJsFunc(function (file) {
return Genai.SourceFile.Metadata.make({ author: 'Joe Blogs' });
})
)
.upsert();The above lambda is triggered during document sync, and would populate the metadata.author field with "Joe Blogs" during the sync activity for every file belonging to that source collection.
If additional metadata fields are required on the Genai.SourceFile.Metadata Type, the Type can be remixed to include additional fields. See Type Inheritance for more details on how to remix a Type.
Optimize chunking performance
For optimal chunking performance:
- Use GPU nodes for Mew3 processing when available
- Scale CPU nodes for high-volume text-only processing
- Monitor queue activity to track processing progress
- Adjust batch sizes based on document size and available memory
See the Multimodal parsing configuration guide for detailed performance tuning and troubleshooting.
Configure the unstructured pipeline
Each Genai.SourceCollection has an associated Genai.UnstructuredPipeline that controls how documents are parsed, tagged, and indexed. The pipeline bundles parser settings, metadata tagging settings, and retriever settings into a single reusable configuration.
Use Genai.UnstructuredPipeline#mergeSettings to update pipeline settings for a source collection:
Genai.SourceCollection.forId('default').unstructuredPipeline.mergeSettings({
retrieverSettings: {
embedMetadata: true,
},
});The Genai.SourceCollection.Metadata.Config#embedMetadata API is deprecated and will be removed in 8.13. Use Genai.UnstructuredPipeline as shown above.
Create embeddings and index documents
After documents are chunked into passages, those passages are transformed into embeddings using the pipeline configured on the source collection. Use Genai.SourceFile#processWithPipeline to index a batch of Genai.SourceFiles asynchronously:
var files = Genai.SourceFile.fetch().objs;
Genai.SourceFile.processWithPipeline(files);To process all unprocessed files for a source collection in one call:
Genai.SourceCollection.forId('default').processUnprocessSourceFiles();Computing embeddings can take a long time. A GPU node reduces indexing time by 10x to 100x for large datasets. See the getting started guide for instructions to configure a GPU.
Validate data integration
After the embeddings are computed and the passages are indexed, you can validate that documents can be retrieved using a similarity search:
Genai.Retriever.PgVector.forId('default-pg').similaritySearch({ searchQuery: 'some words' });This should return the top passages ordered by semantic similarity to the query.