Configuring Unstructured Pipelines
Unstructured pipelines define how documents are processed in C3 Generative AI. Each Genai.SourceCollection references an Genai.UnstructuredPipeline that controls parsing, metadata extraction, and indexing behavior.
Default seeded objects
The application seeds a default pipeline and source collection during initialization:
Default unstructured pipeline
The default-unstructured-pipeline is created with default settings for all processing steps:
- parserSettings: Default parser (Mew3) configuration for chunking and multimodal parsing
- metadataTaggingSettings: Default metadata extraction settings
- retrieverSettings: Default embedding and indexing settings
You can view the default pipeline:
var pipeline = Genai.UnstructuredPipeline.forId('default-unstructured-pipeline');Default source collection
The default-app-collection is linked to the default pipeline and is used for files uploaded through the UI:
var collection = Genai.SourceCollection.forId('default-app-collection');Create a source collection
To create a new source collection with a pipeline:
var collection = Genai.SourceCollection.make({
id: 'my-collection',
name: 'My Document Collection',
description: 'Collection for my documents',
unstructuredPipeline: { id: 'default-unstructured-pipeline' },
});
collection.upsert();The collection will use the referenced pipeline for all document processing.
Create a custom pipeline
To create a new pipeline with custom settings:
var pipeline = Genai.UnstructuredPipeline.make({
id: 'my-custom-pipeline',
name: 'Custom Processing Pipeline',
parserSettings: {}, // Empty object populates defaults via beforeCreate hook
metadataTaggingSettings: {},
retrieverSettings: {},
});
pipeline.upsert();The Genai.UnstructuredPipeline#beforeCreate hook automatically populates default values when you pass empty objects ({}). You can then associate collections with this pipeline:
var collection = Genai.SourceCollection.forId('my-collection');
collection.withUnstructuredPipeline(pipeline).merge();Set sync schedule
Configure automatic syncing for a source collection:
var collection = Genai.SourceCollection.forId('default-app-collection');
var syncSpec = Genai.SourceCollection.SyncSourceSpec.make({
cronExpression: '0 0 2 * * ?', // Daily at 2 AM
enable: true,
shouldProcess: true, // Automatically process files after sync
});
var cronJob = Genai.SourceCollection.Utils.createOrUpdateSyncSchedule(collection, syncSpec);The cronExpression uses Quartz cron syntax. Common patterns:
0 0 2 * * ?- Daily at 2 AM0 0 */6 * * ?- Every 6 hours0 0 2 * * MON- Every Monday at 2 AM
Setting shouldProcess: true automatically triggers document processing after files are synced.
Update pipeline settings
Use Genai.UnstructuredPipeline#mergeSettings to update pipeline configuration. The function merges your changes with existing settings:
Update specific settings
var pipeline = Genai.UnstructuredPipeline.forId('default-unstructured-pipeline');
pipeline.mergeSettings({
parserSettings: {
chunkOverlap: 100,
skipImageParsing: false,
},
});This merges the specified fields into the existing parserSettings, leaving other fields unchanged.
Update multiple settings types
pipeline.mergeSettings({
parserSettings: {
parsingPreset: 'text_heavy',
},
metadataTaggingSettings: {
completionClientName: 'default-completions',
retagPreindexedFiles: false,
},
retrieverSettings: {
embedMetadata: true,
retrieverId: 'my-pg-vector',
},
});For detailed parser configuration options, see Multimodal Parsing.
Disable a pipeline step
To disable a processing step, pass null for that settings field:
var pipeline = Genai.UnstructuredPipeline.forId('default-unstructured-pipeline');
// Disable metadata tagging
pipeline.mergeSettings({
metadataTaggingSettings: null,
});When a step is disabled (set to null), the pipeline skips that processing stage entirely. For example, disabling metadataTaggingSettings prevents automatic metadata extraction.
Reset settings to defaults
To reset settings to their default values, pass an empty object:
var pipeline = Genai.UnstructuredPipeline.forId('default-unstructured-pipeline');
// Reset parser settings to defaults
pipeline.mergeSettings({
parserSettings: null,
});
pipeline.mergeSettings({
parserSettings: {},
});The empty object {} triggers the same default population logic as Genai.UnstructuredPipeline#beforeCreate, restoring factory defaults for that settings type.
Run the pipeline
There are two ways to run the pipeline on documents:
Automatic processing via UI
When you upload files through the UI, they are automatically associated with the collection's pipeline and processed:
- Navigate to Data > Documents
- Select Upload
- Choose files and enable Automatically update search index
- Files are processed using the collection's
unstructuredPipelinesettings
Manual processing via backend
To manually trigger processing for specific files:
var pipeline = Genai.UnstructuredPipeline.forId('default-unstructured-pipeline');
var files = Genai.SourceFile.fetch({
filter: "collection.id == 'my-collection' && status.value == 'NOT_INDEXED'",
limit: 100,
}).objs;
pipeline.run(files, null);Override settings at runtime
Use Genai.UnstructuredPipeline.ExecuteSpec to override pipeline settings for a specific run without modifying the stored pipeline:
var pipeline = Genai.UnstructuredPipeline.forId('default-unstructured-pipeline');
var files = Genai.SourceFile.fetch({ limit: 10 }).objs;
var executeSpec = Genai.UnstructuredPipeline.ExecuteSpec.make({
parserSettings: {
skipTableParsing: true, // Skip tables for this run only
},
retrieverSettings: {
embedMetadata: false, // Don't embed metadata for this run
},
});
pipeline.run(files, executeSpec);The executeSpec overrides apply only to this execution and do not modify the pipeline's stored configuration.
See also
- Multimodal Parsing - Detailed parser configuration
- Metadata Extraction - Metadata tagging configuration
- Configure the Document Processing Pipeline - Overview of the processing workflow
- Unstructured Data Ingestion - Upload and sync documents