Configuring Unstructured Pipelines

Unstructured pipelines define how documents are processed in C3 Generative AI. Each Genai.SourceCollection references an Genai.UnstructuredPipeline that controls parsing, metadata extraction, and indexing behavior.

Default seeded objects

The application seeds a default pipeline and source collection during initialization:

Default unstructured pipeline

The default-unstructured-pipeline is created with default settings for all processing steps:

parserSettings: Default parser (Mew3) configuration for chunking and multimodal parsing
metadataTaggingSettings: Default metadata extraction settings
retrieverSettings: Default embedding and indexing settings

You can view the default pipeline:

JavaScript

var pipeline = Genai.UnstructuredPipeline.forId('default-unstructured-pipeline');

Default source collection

The default-app-collection is linked to the default pipeline and is used for files uploaded through the UI:

JavaScript

var collection = Genai.SourceCollection.forId('default-app-collection');

Create a source collection

To create a new source collection with a pipeline:

JavaScript

var collection = Genai.SourceCollection.make({
  id: 'my-collection',
  name: 'My Document Collection',
  description: 'Collection for my documents',
  unstructuredPipeline: { id: 'default-unstructured-pipeline' },
});
collection.upsert();

The collection will use the referenced pipeline for all document processing.

Create a custom pipeline

To create a new pipeline with custom settings:

JavaScript

var pipeline = Genai.UnstructuredPipeline.make({
  id: 'my-custom-pipeline',
  name: 'Custom Processing Pipeline',
  parserSettings: {}, // Empty object populates defaults via beforeCreate hook
  metadataTaggingSettings: {},
  retrieverSettings: {},
});
pipeline.upsert();

The Genai.UnstructuredPipeline#beforeCreate hook automatically populates default values when you pass empty objects ({}). You can then associate collections with this pipeline:

JavaScript

var collection = Genai.SourceCollection.forId('my-collection');
collection.withUnstructuredPipeline(pipeline).merge();

Set sync schedule

Configure automatic syncing for a source collection:

JavaScript

var collection = Genai.SourceCollection.forId('default-app-collection');

var syncSpec = Genai.SourceCollection.SyncSourceSpec.make({
  cronExpression: '0 0 2 * * ?', // Daily at 2 AM
  enable: true,
  shouldProcess: true, // Automatically process files after sync
});

var cronJob = Genai.SourceCollection.Utils.createOrUpdateSyncSchedule(collection, syncSpec);

The cronExpression uses Quartz cron syntax. Common patterns:

0 0 2 * * ? - Daily at 2 AM
0 0 */6 * * ? - Every 6 hours
0 0 2 * * MON - Every Monday at 2 AM

Setting shouldProcess: true automatically triggers document processing after files are synced.

Update pipeline settings

Use Genai.UnstructuredPipeline#mergeSettings to update pipeline configuration. The function merges your changes with existing settings:

Update specific settings

JavaScript

var pipeline = Genai.UnstructuredPipeline.forId('default-unstructured-pipeline');

pipeline.mergeSettings({
  parserSettings: {
    chunkOverlap: 100,
    skipImageParsing: false,
  },
});

This merges the specified fields into the existing parserSettings, leaving other fields unchanged.

Update multiple settings types

JavaScript

pipeline.mergeSettings({
  parserSettings: {
    parsingPreset: 'text_heavy',
  },
  metadataTaggingSettings: {
    completionClientName: 'default-completions',
    retagPreindexedFiles: false,
  },
  retrieverSettings: {
    embedMetadata: true,
    retrieverId: 'my-pg-vector',
  },
});

For detailed parser configuration options, see Multimodal Parsing.

Disable a pipeline step

To disable a processing step, pass null for that settings field:

JavaScript

var pipeline = Genai.UnstructuredPipeline.forId('default-unstructured-pipeline');

// Disable metadata tagging
pipeline.mergeSettings({
  metadataTaggingSettings: null,
});

When a step is disabled (set to null), the pipeline skips that processing stage entirely. For example, disabling metadataTaggingSettings prevents automatic metadata extraction.

Reset settings to defaults

To reset settings to their default values, pass an empty object:

JavaScript

var pipeline = Genai.UnstructuredPipeline.forId('default-unstructured-pipeline');

// Reset parser settings to defaults
pipeline.mergeSettings({
  parserSettings: null,
});

pipeline.mergeSettings({
  parserSettings: {},
});

The empty object {} triggers the same default population logic as Genai.UnstructuredPipeline#beforeCreate, restoring factory defaults for that settings type.

Run the pipeline

There are two ways to run the pipeline on documents:

Automatic processing via UI

When you upload files through the UI, they are automatically associated with the collection's pipeline and processed:

Navigate to Data > Documents
Select Upload
Choose files and enable Automatically update search index
Files are processed using the collection's unstructuredPipeline settings

Manual processing via backend

To manually trigger processing for specific files:

JavaScript

var pipeline = Genai.UnstructuredPipeline.forId('default-unstructured-pipeline');
var files = Genai.SourceFile.fetch({
  filter: "collection.id == 'my-collection' && status.value == 'NOT_INDEXED'",
  limit: 100,
}).objs;

pipeline.run(files, null);

Override settings at runtime

Use Genai.UnstructuredPipeline.ExecuteSpec to override pipeline settings for a specific run without modifying the stored pipeline:

JavaScript

var pipeline = Genai.UnstructuredPipeline.forId('default-unstructured-pipeline');
var files = Genai.SourceFile.fetch({ limit: 10 }).objs;

var executeSpec = Genai.UnstructuredPipeline.ExecuteSpec.make({
  parserSettings: {
    skipTableParsing: true, // Skip tables for this run only
  },
  retrieverSettings: {
    embedMetadata: false, // Don't embed metadata for this run
  },
});

pipeline.run(files, executeSpec);

The executeSpec overrides apply only to this execution and do not modify the pipeline's stored configuration.

Copy link to this sectionDefault seeded objects

Copy link to this sectionDefault unstructured pipeline

Copy link to this sectionDefault source collection

Copy link to this sectionCreate a source collection

Copy link to this sectionCreate a custom pipeline

Copy link to this sectionSet sync schedule

Copy link to this sectionUpdate pipeline settings

Copy link to this sectionUpdate specific settings

Copy link to this sectionUpdate multiple settings types

Copy link to this sectionDisable a pipeline step

Copy link to this sectionReset settings to defaults

Copy link to this sectionRun the pipeline

Copy link to this sectionAutomatic processing via UI

Copy link to this sectionManual processing via backend

Copy link to this sectionOverride settings at runtime

Copy link to this sectionSee also

Default seeded objects

Default unstructured pipeline

Default source collection

Create a source collection

Create a custom pipeline

Set sync schedule

Update pipeline settings

Update specific settings

Update multiple settings types

Disable a pipeline step

Reset settings to defaults

Run the pipeline

Automatic processing via UI

Manual processing via backend

Override settings at runtime

See also