Configure the Document Processing Pipeline

C3 Generative AI comes out-of-the-box with a data integration pipeline for syncing, chunking, embedding, and persisting the contents from supported text documents.

Connect to remote source systems

The system supports integrations with each of the major cloud providers' storage services, namely Amazon S3, Google Cloud Storage, and Azure Blob Storage. For more information on connecting to remote file systems, see the following section of the C3 AI Data Integration guide.

The Genai.SourceSystem types (Genai.SourceSystem.S3, Genai.SourceSystem.Azure, Genai.SourceSystem.Gcs) were deprecated in 8.9. Use the C3 AI platform Data Integration (DI) pipeline to connect remote storage instead.

Add external sources

The Genai.SourceCollection Type is used to model external data sources within the application. A Genai.SourceCollection stores a reference to an external file system path, and can be synced to store references to each file in the remote file system. These references are stored in the Genai.SourceFile Type.

A default source collection (default-app-collection) is seeded automatically and linked to the default unstructured pipeline. Additional collections can be created to:

Map ACL permissions to specific users or groups based on the directory structure of a remote file system
Independently manage sync and process schedules for different collections of source data

Each source collection should reference a Genai.UnstructuredPipeline that defines how documents are processed (chunking, parsing, metadata extraction, indexing). See Configuring Unstructured Pipelines for details on creating and configuring pipelines for your collections.

Define access control lists (optional)

Apply access control lists (ACLs) to control which user groups have access to a Genai.SourceFile. All users have access to a file by default. To restrict access to a specific source file, run the following code:

JavaScript

var sfId = "<source_file_id>";
var groupsWithAccess = ["<group_name1>", "<group_name2>", ...];
var sf = Genai.SourceFile.forId(sfId);
sf.withGroups(groupsWithAccess).merge();

After the ACL has been updated, access to the contents of the restricted Genai.SourceFile will be restricted at retrieval time to users who are in at least one of the allowed groups.

For full ACL setup instructions, see Other Capabilities.

Specify chunking behavior

Before computing embeddings for a document, documents are chunked using a Genai.SourceFile.Chunker into Genai.SourcePassages. By default, the Genai.SourceFile.Chunker.UniversalChunker.Spec is used, which selects the most appropriate chunking implementation based on the file extension of the document. You can override the default configuration of the universal chunking component to use a specific chunking implementation for specific file extensions.

Trigger unstructured data integration

Syncing a Genai.SourceCollection creates Genai.SourceFile instances, which store references to the external source documents:

JavaScript

Genai.SourceCollection.forId('default').sync();

Alternatively, if there are multiple Genai.SourceCollection instances, you can sync them all with:

JavaScript

Genai.SourceCollection.Utils.sync();

Some metadata, such as the file type and the source collection it belongs to, are automatically captured when a Genai.SourceFile is created from the sync process. Optionally, additional Genai.SourceFile.Metadata can be populated to a Genai.SourceCollection’s source files that are created during the sync operation. To extract metadata, use the Genai.SourceCollection#syncMetadataLambda field to define a lambda function on the source collection, allowing for arbitrary control over the metadata fields that are set during ingest. This lambda function takes a Genai.SourceFile instance as an input, and outputs a Genai.SourceFile.Metadata instance.

For instance, let's say you want to fill the author of a document on ingest, and want to apply the static value "Joe Blogs" as the author for every source file that is ingested for a given source collection. You could define the following:

JavaScript

gsc
  .withField(
    'syncMetadataLambda',
    Lambda.fromJsFunc(function (file) {
      return Genai.SourceFile.Metadata.make({ author: 'Joe Blogs' });
    })
  )
  .upsert();

The above lambda is triggered during document sync, and would populate the metadata.author field with "Joe Blogs" during the sync activity for every file belonging to that source collection.

If additional metadata fields are required on the Genai.SourceFile.Metadata Type, the Type can be remixed to include additional fields. See Type Inheritance for more details on how to remix a Type.

Optimize chunking performance

For optimal chunking performance:

Use GPU nodes for Mew3 processing when available
Scale CPU nodes for high-volume text-only processing
Monitor queue activity to track processing progress
Adjust batch sizes based on document size and available memory

See the Multimodal parsing configuration guide for detailed performance tuning and troubleshooting.

Configure the unstructured pipeline

Each Genai.SourceCollection has an associated Genai.UnstructuredPipeline that controls how documents are parsed, tagged, and indexed. The pipeline bundles parser settings, metadata tagging settings, and retriever settings into a single reusable configuration.

Use Genai.UnstructuredPipeline#mergeSettings to update pipeline settings for a source collection:

JavaScript

Genai.SourceCollection.forId('default').unstructuredPipeline.mergeSettings({
  retrieverSettings: {
    embedMetadata: true,
  },
});

The Genai.SourceCollection.Metadata.Config#embedMetadata API is deprecated and will be removed in 8.13. Use Genai.UnstructuredPipeline as shown above.

Create embeddings and index documents

After documents are chunked into passages, those passages are transformed into embeddings using the pipeline configured on the source collection. Use Genai.SourceFile#processWithPipeline to index a batch of Genai.SourceFiles asynchronously:

JavaScript

var files = Genai.SourceFile.fetch().objs;
Genai.SourceFile.processWithPipeline(files);

To process all unprocessed files for a source collection in one call:

JavaScript

Genai.SourceCollection.forId('default').processUnprocessSourceFiles();

Computing embeddings can take a long time. A GPU node reduces indexing time by 10x to 100x for large datasets. See the getting started guide for instructions to configure a GPU.

Validate data integration

After the embeddings are computed and the passages are indexed, you can validate that documents can be retrieved using a similarity search:

JavaScript

Genai.Retriever.PgVector.forId('default-pg').similaritySearch({ searchQuery: 'some words' });

This should return the top passages ordered by semantic similarity to the query.

Copy link to this sectionConnect to remote source systems

Copy link to this sectionAdd external sources

Copy link to this sectionDefine access control lists (optional)

Copy link to this sectionSpecify chunking behavior

Copy link to this sectionTrigger unstructured data integration

Copy link to this sectionOptimize chunking performance

Copy link to this sectionConfigure the unstructured pipeline

Copy link to this sectionCreate embeddings and index documents

Copy link to this sectionValidate data integration

Copy link to this sectionSee also

Connect to remote source systems

Add external sources

Define access control lists (optional)

Specify chunking behavior

Trigger unstructured data integration

Optimize chunking performance

Configure the unstructured pipeline

Create embeddings and index documents

Validate data integration

See also