C3 AI Documentation Home

Configuring Unstructured Pipelines

Unstructured pipelines define how documents are processed in C3 Generative AI. Each Genai.SourceCollection references an Genai.UnstructuredPipeline that controls parsing, metadata extraction, and indexing behavior.

Default seeded objects

The application seeds a default pipeline and source collection during initialization:

Default unstructured pipeline

The default-unstructured-pipeline is created with default settings for all processing steps:

  • parserSettings: Default parser (Mew3) configuration for chunking and multimodal parsing
  • metadataTaggingSettings: Default metadata extraction settings
  • retrieverSettings: Default embedding and indexing settings

You can view the default pipeline:

JavaScript
var pipeline = Genai.UnstructuredPipeline.forId('default-unstructured-pipeline');

Default source collection

The default-app-collection is linked to the default pipeline and is used for files uploaded through the UI:

JavaScript
var collection = Genai.SourceCollection.forId('default-app-collection');

Create a source collection

To create a new source collection with a pipeline:

JavaScript
var collection = Genai.SourceCollection.make({
  id: 'my-collection',
  name: 'My Document Collection',
  description: 'Collection for my documents',
  unstructuredPipeline: { id: 'default-unstructured-pipeline' },
});
collection.upsert();

The collection will use the referenced pipeline for all document processing.

Create a custom pipeline

To create a new pipeline with custom settings:

JavaScript
var pipeline = Genai.UnstructuredPipeline.make({
  id: 'my-custom-pipeline',
  name: 'Custom Processing Pipeline',
  parserSettings: {}, // Empty object populates defaults via beforeCreate hook
  metadataTaggingSettings: {},
  retrieverSettings: {},
});
pipeline.upsert();

The Genai.UnstructuredPipeline#beforeCreate hook automatically populates default values when you pass empty objects ({}). You can then associate collections with this pipeline:

JavaScript
var collection = Genai.SourceCollection.forId('my-collection');
collection.withUnstructuredPipeline(pipeline).merge();

Set sync schedule

Configure automatic syncing for a source collection:

JavaScript
var collection = Genai.SourceCollection.forId('default-app-collection');

var syncSpec = Genai.SourceCollection.SyncSourceSpec.make({
  cronExpression: '0 0 2 * * ?', // Daily at 2 AM
  enable: true,
  shouldProcess: true, // Automatically process files after sync
});

var cronJob = Genai.SourceCollection.Utils.createOrUpdateSyncSchedule(collection, syncSpec);

The cronExpression uses Quartz cron syntax. Common patterns:

  • 0 0 2 * * ? - Daily at 2 AM
  • 0 0 */6 * * ? - Every 6 hours
  • 0 0 2 * * MON - Every Monday at 2 AM

Setting shouldProcess: true automatically triggers document processing after files are synced.

Update pipeline settings

Use Genai.UnstructuredPipeline#mergeSettings to update pipeline configuration. The function merges your changes with existing settings:

Update specific settings

JavaScript
var pipeline = Genai.UnstructuredPipeline.forId('default-unstructured-pipeline');

pipeline.mergeSettings({
  parserSettings: {
    chunkOverlap: 100,
    skipImageParsing: false,
  },
});

This merges the specified fields into the existing parserSettings, leaving other fields unchanged.

Update multiple settings types

JavaScript
pipeline.mergeSettings({
  parserSettings: {
    parsingPreset: 'text_heavy',
  },
  metadataTaggingSettings: {
    completionClientName: 'default-completions',
    retagPreindexedFiles: false,
  },
  retrieverSettings: {
    embedMetadata: true,
    retrieverId: 'my-pg-vector',
  },
});

For detailed parser configuration options, see Multimodal Parsing.

Disable a pipeline step

To disable a processing step, pass null for that settings field:

JavaScript
var pipeline = Genai.UnstructuredPipeline.forId('default-unstructured-pipeline');

// Disable metadata tagging
pipeline.mergeSettings({
  metadataTaggingSettings: null,
});

When a step is disabled (set to null), the pipeline skips that processing stage entirely. For example, disabling metadataTaggingSettings prevents automatic metadata extraction.

Reset settings to defaults

To reset settings to their default values, pass an empty object:

JavaScript
var pipeline = Genai.UnstructuredPipeline.forId('default-unstructured-pipeline');

// Reset parser settings to defaults
pipeline.mergeSettings({
  parserSettings: null,
});

pipeline.mergeSettings({
  parserSettings: {},
});

The empty object {} triggers the same default population logic as Genai.UnstructuredPipeline#beforeCreate, restoring factory defaults for that settings type.

Run the pipeline

There are two ways to run the pipeline on documents:

Automatic processing via UI

When you upload files through the UI, they are automatically associated with the collection's pipeline and processed:

  1. Navigate to Data > Documents
  2. Select Upload
  3. Choose files and enable Automatically update search index
  4. Files are processed using the collection's unstructuredPipeline settings

Manual processing via backend

To manually trigger processing for specific files:

JavaScript
var pipeline = Genai.UnstructuredPipeline.forId('default-unstructured-pipeline');
var files = Genai.SourceFile.fetch({
  filter: "collection.id == 'my-collection' && status.value == 'NOT_INDEXED'",
  limit: 100,
}).objs;

pipeline.run(files, null);

Override settings at runtime

Use Genai.UnstructuredPipeline.ExecuteSpec to override pipeline settings for a specific run without modifying the stored pipeline:

JavaScript
var pipeline = Genai.UnstructuredPipeline.forId('default-unstructured-pipeline');
var files = Genai.SourceFile.fetch({ limit: 10 }).objs;

var executeSpec = Genai.UnstructuredPipeline.ExecuteSpec.make({
  parserSettings: {
    skipTableParsing: true, // Skip tables for this run only
  },
  retrieverSettings: {
    embedMetadata: false, // Don't embed metadata for this run
  },
});

pipeline.run(files, executeSpec);

The executeSpec overrides apply only to this execution and do not modify the pipeline's stored configuration.

See also

Was this page helpful?