Configure the Document Processing Pipeline

C3 Generative AI comes out-of-the-box with a data integration pipeline for syncing, chunking, embedding, and persisting the contents from supported text documents.

Connecting to remote source systems

The system supports integrations with each of the major cloud providers' storage services, namely Amazon S3, Google Cloud Storage, and Azure Blob Storage. For more information on connecting to remote file systems, see the following section of the C3 AI Data Integration guide.

Adding external sources

The Genai.SourceCollection Type is used to model external data sources within the application. A Genai.SourceCollection stores a reference to an external file system path, and can be synced to store references to each file in the remote file system. These references are stored in the Genai.SourceFile Type.

Any number of external sources can be created to:

Map ACL permissions to specific users or groups based on the directory structure of a remote file system
Independently manage sync and process schedules for different collections of source data

To create an external source, use the following code sample:

JavaScript

var gscId = 'default';
var name = 'Default Collection';
var description = 'Default source collection';
var rootUrl = FileSystem.mounts().get('data-load');
var targetUrl = FileSystem.mounts().get('vector-store');
var gsc = Genai.SourceCollection.make({
  id: gscId,
  name: name,
  description: description,
  rootUrl: rootUrl,
  targetUrl: targetUrl,
});
gsc.upsert();

(Optional) Defining access control lists (ACLs)

Apply access control lists (ACLs) to control which user groups have access to a Genai.SourceFile. All users have access to a file by default. To restrict access to a specific source file, run the following code:

JavaScript

var sfId = "<source_file_id>";
var groupsWithAccess = ["<group_name1>", "<group_name2>", ...];
var sf = Genai.SourceFile.forId(sfId);
sf.withGroups(groupsWithAccess).merge();

After the ACL has been updated, access to the contents of the restricted Genai.SourceFile will be restricted at retrieval time to users who are in at least one of the allowed groups.

Specifying chunking behavior

Before computing embeddings for a document, documents are chunked using a Genai.SourceFile.Chunker into Genai.SourcePassages. By default, the Genai.SourceFile.Chunker.UniversalChunker.Spec is used, which selects the most appropriate chunking implementation based on the file extension of the document. You can override the default configuration of the universal chunking component to use a specific chunking implementation for specific file extensions.

Triggering unstructured data integration

Syncing a Genai.SourceCollection creates Genai.SourceFile instances, which store references to the external source documents:

JavaScript

Genai.SourceCollection.forId('default').sync();

Alternatively, if there are multiple Genai.SourceCollection instances, you can sync them all with:

JavaScript

Genai.SourceCollection.Utils.sync();

Some metadata, such as the file type and the source collection it belongs to, are automatically captured when a Genai.SourceFile is created from the sync process. Optionally, additional Genai.SourceFile.Metadata can be populated to a Genai.SourceCollection’s source files that are created during the sync operation. To extract metadata, use the Genai.SourceCollection#syncMetadataLambda field to define a lambda function on the source collection, allowing for arbitrary control over the metadata fields that are set during ingest. This lambda function takes a Genai.SourceFile instance as an input, and outputs a Genai.SourceFile.Metadata instance.

For instance, let's say you want to fill the author of a document on ingest, and want to apply the static value "Joe Blogs" as the author for every source file that is ingested for a given source collection. You could define the following:

JavaScript

gsc
  .withField(
    'syncMetadataLambda',
    Lambda.fromJsFunc(function (file) {
      return Genai.SourceFile.Metadata.make({ author: 'Joe Blogs' });
    })
  )
  .upsert();

The above lambda is triggered during document sync, and would populate the metadata.author field with "Joe Blogs" during the sync activity for every file belonging to that source collection.

If additional metadata fields are required on the Genai.SourceFile.Metadata Type, the Type can be remixed to include additional fields. See Type Inheritance for more details on how to remix a Type.

Performance Optimization for Chunking

For optimal chunking performance:

Use GPU nodes for Mew3 processing when available
Scale CPU nodes for high-volume text-only processing
Monitor queue activity to track processing progress
Adjust batch sizes based on document size and available memory

See the Multimodal parsing configuration guide for detailed performance tuning and troubleshooting.

Creating embeddings and indexing documents

After documents have been chunked into passages, those passages must be transformed into embeddings. Optionally, Genai.SourceFile.Metadata fields can also be included in the embeddings for each child passage. In most cases, this will improve the relevancy of the source passages retrieved because it removes the need for document metadata values to be directly referenced in the passage itself. To include Genai.SourceFile.Metadata in the embeddings, set the Genai.SourceCollection.Metadata.Config#embedMetadata field for the relevant Genai.SourceCollections:

JavaScript

Genai.SourceCollection.forId('default').config().setConfigValue('embedMetadata', true);

Genai.SourcePassage embeddings are persisted in the vector store that is configured on the QueryEngineConfig#vectorStore. Use the following code sample to index a batch of Genai.SourceFiles asynchronously:

JavaScript

var files = Genai.SourceFile.fetch().objs;
Genai.SourceFile.process(files);

Computing the embeddings can take a long time, but the performance can be significantly improved with a small GPU. See the getting started guide for instructions to configure a GPU to accelerate indexing.

Validating data integration

Once the embeddings have been computed and the passages have been indexed, you can check to ensure that the documents can be retrieved using a similarity search:

JavaScript

var vs = Genai.UnstructuredQuery.Engine.Config.inst().vectorStore;
vs.similaritySearch({ searchQuery: 'some words' });

This should return the top passages ordered by semantic similarity to the query.

Copy link to this sectionConnecting to remote source systems

Copy link to this sectionAdding external sources

Copy link to this section(Optional) Defining access control lists (ACLs)

Copy link to this sectionSpecifying chunking behavior

Copy link to this sectionTriggering unstructured data integration

Copy link to this sectionPerformance Optimization for Chunking

Copy link to this sectionCreating embeddings and indexing documents

Copy link to this sectionValidating data integration

Connecting to remote source systems

Adding external sources

(Optional) Defining access control lists (ACLs)

Specifying chunking behavior

Triggering unstructured data integration

Performance Optimization for Chunking

Creating embeddings and indexing documents

Validating data integration