C3 AI Documentation Home

Multimodal Parsing

When you process different files, you often need to extract more than just text. Key information may appear in tables, images, or page layouts. To handle this, multimodal parsing combines layout detection with content segmentation. A layout parser first analyzes the visual structure of each page. It then uses bounding boxes to identify and separate different types of content. This approach enables accurate extraction of diverse content types and supports advanced indexing, retrieval, and search workflows.

Introduction to Mew3

Mew3 Genai.SourceFile.Chunker.Mew3 is a chunking component that supports multimodal parsing. It segments documents into structured, meaningful units using the layout structure to extract text, images, and tables. Mew3 improves the quality and consistency of parsing by integrating layout-based segmentation directly into the parsing pipeline.

By default, Mew3 uses the Docling parser to perform layout analysis. Docling is an open-source toolkit that supports various document formats, including PDF, DOCX, HTML, and images. It uses models such as DocLayNet and TableFormer to detect visual structures and identify tables. This enables Mew3 to obtain accurate bounding boxes and use them to segment content for indexing and search.

Mew3 operates as a modular pipeline. It uses producers to extract layout elements and executors to apply additional processing. This design makes Mew3 easy to configure, debug, and extend for different use cases.

Setting up Mew3

Ensure the environment and application are running with valid large language model credentials. To set the credentials, see Application Initialization.

When you run the quick start with the required model, the platform automatically enables Genai.SourceFile.Chunker.Mew3 as the default chunking component for unstructured files.

To confirm that the application is using Mew3 as the chunking component for unstructured files, run the following command in the Application C3 AI Console:

JavaScript
var chunkerConfig = Genai.SourceFile.Chunker.UniversalChunker.Config.forConfigKey('default');
var fileExtToChunkerSpecMap = C3.Map.fromJson(chunkerConfig.fileExtToChunkerSpecMap);
fileExtToChunkerSpecMap.get('.pdf').get('chunker');

The output should display Genai.SourceFile.Chunker.Mew3, confirming that the configured chunking component is active.

If the Mew3 chunking component is not set up by default, you can follow the steps below.

Enable Mew3

Run the following in the Application C3 AI Console to enable Genai.SourceFile.Chunker.Mew3 for multimodal parsing:

JavaScript
Genai.QuickStart.enableMew3();

This command sets Mew3 as the default chunking component for the files. It uses all default settings required for operation.

Build Mew3 Documentation

Run the following in the Application C3 AI Console to build the documentation for the Mew3 library. This command will take a few minutes to complete.

JavaScript
Genai.SourceFile.Chunker.Mew3.showDocs();

Changing configs in mew3

Verify configuration

JavaScript
var chunkerConfig = Genai.SourceFile.Chunker.UniversalChunker.Config.forConfigKey('default');
var fileExtToChunkerSpecMap = C3.Map.fromJson(chunkerConfig.fileExtToChunkerSpecMap);
fileExtToChunkerSpecMap.get('.pdf').get('chunker');

The output should display Genai.SourceFile.Chunker.Mew3, confirming that the configured chunker is active.

Node Pool Configuration

Mew3 automatically uses a GPU if one is available in your cluster. Run the following command in the Application C3 AI Console to verify GPU access on a task node:

JavaScript
C3.app().nodePool('task').config();

Review the value under the hardwareProfile.gpu field. The system sets this field to 0 when no GPU is configured on the task node.

If a GPU is not available, Mew3 automatically uses the CPU. Chunking is ~25% slower when running on a CPU, but you can increase performance by scaling the number of task nodes. For guidance on scaling task nodes and distributing chunking across nodes, refer to the Environment Sizing Guide.

Configuring Mew3

Mew3 supports more than 90 configuration options. However the default should be powerful enough for most use cases and you should only need to modify a few key settings to control model selection, update behavior, or resource usage.

Perform all configuration steps in this section using JupyterLab. To open a notebook, hover over the application card in C3 AI Studio and select Jupyter. Ensure the notebook runs with the py-mew3 runtime environment.

Access the Configuration

The following command returns the current configuration object:

Python
settings = c3.Genai.ParserSettings.make({'parsingPreset': 'default'})
c3.Genai.ParserSettings.getParserSettings(settings)

The configuration is organized into three main areas:

  • pipelineSpec — Controls all parser and chunker behavior for text, images, tables, and layout.
  • documentSpec - Specifies document processing and cropping strategies
  • skipParsing — Flags to enable or skip text, table, or image parsing.

Note: As of version 8.11, Genai.SourceFile.Chunker.Mew3.Config is deprecated in favor of Genai.ParserSettings. The examples below use Genai.ParserSettings where possible to reflect the recommended long-term API.


The following sections describe how to update specific parameters to customize Mew3 behavior for your environment. Each parameter includes examples and expected values.

Updating pipeline specs

Before you update the pipeline spec, read the Mew3 API documentation with the Genai.SourceFile.Chunker.Mew3.showDocs() command in the C3 AI Console of your application and read the How to change your configs section on the left-hand navigation.

Changing configs in mew3

Understanding the pipeline spec structure is crucial for customizing how Mew3 processes different document types. The pipelineSpec field on Genai.ParserSettings defines the complete parsing workflow, organized hierarchically:

Pipeline SpecificationComponent Specification (layout parser, image parser, text parser, table parser) → File Type (pdf, docx, xlsx) → Processing Steps (modules like DOCLING, FILTER_DOC, etc.)

  • Pipeline Specification
    • Component Specification
      • Text Parser
      • Layout Parser
      • Image Parser
      • Table Parser
        • File Type
          • PDF
          • DOCX
          • XLSX
            • Processing Steps
              • DOCLING
              • FILTER_DOC
              • ...

Each component specification maps file formats to sequential processing steps. For example, the layout parser processes PDFs through steps like "DOCLING" for structure detection, "FILTER_DOC" for confidence filtering, "UNION_BOX" for merging overlapping elements, and "ORDER_BY_COLUMN" for reading order. Processing steps can be defined as simple strings like "UNION_BOX" or as dictionaries with "kind" (module name) and "kwargs" (parameters) for complex configurations.

Layout Parser spec

This pipeline determines how the system detects and processes layout elements on each page, including headers, text, tables, lists, and their positions and uses DOCLING under the hood.

Set bounding box thresholds

Mew3 uses a confidence threshold (through the FILTER_DOC step) to decide which layout elements to keep from the layout parser. By default, FILTER_DOC uses built-in thresholds. To customize these thresholds, replace the "FILTER_DOC" string with a dictionary that specifies bbox_class_threshold:

Python
{
    "kind": "FILTER_DOC",
    "kwargs": {
        "bbox_class_threshold": {
            "Caption": 0.2,
            "Footnote": 0.2,
            "Formula": 0.2,
            "List-item": 0.2,
            "Page-footer": 0.2,
            "Page-header": 0.3,
            "Picture": 0.5,
            "Section-header": 0.2,
            "Table": 0.2,
            "Text": 0.2,
            "Title": 0.2
        }
    }
}

These values set the minimum prediction confidence required to keep a bounding box for each layout class. You can adjust them if you observe unwanted content removal or retention:

  • Lower a value to keep more boxes of that type (for example, Picture: 0.6 → 0.4).
  • Raise a value to remove noisy or low-confidence elements (for example, Page-footer: 0.2 → 0.5).
Python

layout_parser_spec = {
    "pdf": [
    {
        "kind": "DOCLING",
        "kwargs": {}
    },
    "FILTER_DOC",
    "FILTER_EMPTY_TEXT_BBOXES",
    "UNION_BOX",  # Union boxes of same class
    "NO_OVERLAP",  # Remove overlapping boxes of same class
    "ORDER_BY_COLUMN(grid_step=150)",  # Specify column spacing
    "TO_MARKDOWN_CLASS",  # Convert raw layout classifications to standardized markdown types
    "MERGE_LIST",  # Combine consecutive list items into unified list blocks
    "MERGE_MANY_THINGS_BY_COLUMN",  # Group related content elements within the same column layout
    "RESOLVE_CLASS_CONFLICT(iou_threshold=0.85)", # merge two bbox of different class if IoU is large.
    "REMOVE_CONTAINED_BOXES(containment_threshold=0.9)", # when small bboxes are contained inside one large bbox, remove the small ones if overlapping area is large.
    "MERGE_TEXT_UNDER_HEADER", # For a given HEADER, merge all PLAIN below. Operates on per-pages level.
    "TREAT_IMAGE_FILLED_WITH_TEXT_AS_TEXT(threshold=0.7)",
    "GROW_BOX(markdown_class=['PLAIN', 'LIST', 'HEADER', 'CAPTION', 'TABLE', 'FIGURE'], rel_to='pixel', pixel_upper=2, pixel_lower=2, pixel_left=2, pixel_right=2)", # increase size of TABLE, FIGURE, or OTHER bbox
    "RANGE(column='global_idx')"
    ]
}

Use top-level threshold fields

As an alternative to modifying the FILTER_DOC step in the pipeline spec, Mew3 provides three top-level fields that set detection thresholds directly:

FieldDefaultApplies to
layoutDetectionThresholdPicture0.5Picture elements
layoutDetectionThresholdTable0.2Table elements
layoutDetectionThresholdText0.2Body text only (not headers, captions, or footnotes)
JavaScript
Genai.UnstructuredPipeline.forId('myPipeline').mergeSettings({
    parserSettings: {
        layoutDetectionThresholdPicture: 0.6,
        layoutDetectionThresholdTable: 0.3
    }
});

Text parser spec

This pipeline parses text content such as plain text, headers, lists, and captions from the files. It filters relevant blocks and assigns structural context, including header hierarchy.

Python
text_parser_spec = {
    "pdf": [
        {
            "kind": "FILTER",  # only keep text
            "kwargs": {
                "column": "markdown_class",
                "include": ["HEADER", "PLAIN", "LIST", "CAPTION"],
            },
        },
        "TEXT_EXTRACTOR.DEFAULT",  # default: use cropping method
        {
            "kind": "GET_PARENT_HEADER_CACHED",  # get parent header for text by searching for previous bboxes
            "kwargs": {
                "extract_header_for_type": ["PLAIN", "LIST"],
                "header_parent_types": ["HEADER"],
                "enable_cross_page_search": true
            },
        },
        "TAG(column='parser', tag='TEXT')",  # add 'parser' field to bbox
    ]
}

This setup extracts useful text from the document, including headings, paragraphs, lists, and captions. It determines the appropriate heading for each part, removes elements that are not text, and labels the result so later steps recognize it as output from a text parser.

You usually do not need to change this unless you are creating a custom setup.

Image parser spec

This pipeline detects and processes images in documents. It extracts each image, collects nearby text such as captions, and uses a language model to generate a description of the image. The image and its description are saved for further use.

Python
image_parser_spec = {
    "pdf": [
        {
            "kind": "GET_SURROUNDING_TEXT_CACHED",  # extract text around image using adjacent bboxes
            "kwargs": {
                "current_bbox_typ": "FIGURE",
                "extract_bbox_with_offsets": [-3, -2, -1, 1, 2, 3],
                "extract_bbox_with_type": ["CAPTION", "PLAIN", "HEADER", "LIST"],
                "enable_cross_page_search": false
            }
        },
        {
            "kind": "CAPTION_FINDER.LOCALITY_CACHED",  # search for caption in adjacent bboxes
            "kwargs": {
                "current_bbox_typ": ["FIGURE"],
                "search_bbox_with_offsets": [-1, 1, -2, 2, -3, 3],  # represents the relative index to search for caption in sequence. Elements at front have higher priorities.
                "target_bbox_type": "CAPTION",
                "bbox_types_to_search_for": ["CAPTION", "PLAIN", "HEADER"],
                "enable_cross_page_search": false
            }
        },
        {
            "kind": "GET_PARENT_HEADER_CACHED",  # extract parent header information for context
            "kwargs": {
                "extract_header_for_type": ["FIGURE"],
                "header_parent_types": ["HEADER"],
                "enable_cross_page_search": true
            }
        },
        {
            "kind": "FILTER",  # only keep FIGURE
            "kwargs": {
                "column": "markdown_class",
                "include": ["FIGURE"]
            }
        },
        "IMAGE_EXTRACT",
        "IMAGE_VERBAL.LLM",  # disable verbalization: "IMAGE_VERBAL.NULL"
        "SAVE_IMAGE_C3",  # save image to FileSystem
        {
            "kind": "TAG",  # add 'parser' field to bbox
            "kwargs": {
                "column": "parser",
                "tag": "IMAGE"
            }
        }
    ]
}

Table parser spec

This pipeline finds tables in documents and uses a language model to summarize the contents of each table. It also captures surrounding text to provide context for each table and saves the output as a table image.

Python
table_parser_spec = {
    "pdf": [
        {
            "kind": "GET_SURROUNDING_TEXT_CACHED",  # extract text around table using adjacent bboxes
            "kwargs": {
                "current_bbox_typ": "TABLE",
                "extract_bbox_with_offsets": [-3, -2, -1, 1, 2, 3],
                "extract_bbox_with_type": ["CAPTION", "PLAIN", "HEADER", "LIST"],
                "enable_cross_page_search": false
            }
        },
        {
            "kind": "CAPTION_FINDER.LOCALITY_CACHED",  # search for caption in adjacent bboxes
            "kwargs": {
                "current_bbox_typ": ["TABLE"],
                "search_bbox_with_offsets": [-1, 1, -2, 2, -3, 3],  # represents the relative index to search for caption in sequence. Elements at front have higher priorities.
                "target_bbox_type": "HEADER",
                "bbox_types_to_search_for": ["HEADER", "CAPTION", "PLAIN"],
                "enable_cross_page_search": false
            }
        },
        {
            "kind": "GET_PARENT_HEADER_CACHED",  # extract parent header information for context
            "kwargs": {
                "extract_header_for_type": ["TABLE"],
                "header_parent_types": ["HEADER"],
                "enable_cross_page_search": true
            }
        },
        "RENAME(source='caption', target='title')",  # rename field on bbox
        {
            "kind": "FILTER",  # only keep TABLE
            "kwargs": {
                "column": "markdown_class",
                "include": ["TABLE"]
            }
        },
        "TABLE_IMAGE_EXTRACT",  # crop image from page
        "TABLE_VERBAL.LLM",  # non-LLM options: "TABLE_VERBAL.TITLE_AND_CONTENT", "TABLE_VERBAL.NULL"
        "SAVE_TABLE_IMAGE_C3",  # save table image to FileSystem
        "TAG(column='parser', tag='TABLE')"  # add 'parser' field to bbox
    ]
}

Text chunker spec

Splits long text blocks into smaller chunks so they can be indexed, searched, or used by language models more effectively.

Python
text_chunker_spec = {
    "pdf": [
        {
            "kind": "LONG_CHAIN",  # splits long text into overlapping chunks
            "kwargs": {
                "column_to_chunk": "text",
                "split_overlap": 200  # overlap between consecutive chunks in characters
            }
        }
    ]
}
  • split_overlap — Controls character overlap between chunks. Higher values (like 200) provide more context continuity but increase duplication. Lower values reduce duplication but may lose context at chunk boundaries.

Table chunker spec

Splits the LLM-generated description of a table into smaller chunks. This helps improve retrieval performance and allows language models to process long table summaries more efficiently.

Python
table_chunker_spec = {
    "pdf":[
        {
            "kind": "LONG_CHAIN",  # splits table verbalization into overlapping chunks
            "kwargs": {
                "column_to_chunk": "verbalization",
                "split_overlap": 200,  # overlap between consecutive chunks in characters
                "split_length": 1000   # maximum chunk size in characters
            }
        }
    ]
}
  • split_length — Set to 1000 characters to create larger chunks for table descriptions. Increase this to produce fewer, longer chunks for better performance, or decrease for more granular search precision.
  • split_overlap — Set to 200 characters for good context continuity between chunks. Decrease this to reduce content duplication and save storage space.

Configure chunk overlap and merging

chunkOverlap

Controls the number of characters that overlap between consecutive text chunks. Higher overlap preserves context across chunk boundaries but increases the embedding index size.

  • Default: 200
JavaScript
Genai.UnstructuredPipeline.forId('myPipeline').mergeSettings({ parserSettings: { chunkOverlap: 100 } });

doMergeChunksWithSizeLimit

When set, Mew3 merges adjacent small chunks into larger ones up to the specified character limit. This reduces fragmentation when documents contain many short paragraphs or list items. Enabled by default with a value of 1000. To change this attribute, see the code below.

JavaScript
Genai.UnstructuredPipeline.forId('myPipeline').mergeSettings({ parserSettings: { doMergeChunksWithSizeLimit: 750 } });

Update pipeline configurations

To configure Mew3, always follow this sequence:

  1. Copy the configuration block for the section you want to change (for example, text_parser_spec, image_parser_spec, table_parser_spec, or layout_parser_spec).
  2. Edit only the parameters you want to change.
  3. When applying changes, always pass all updated specs together using setConfigValue(). Each update must include every previously customized spec to avoid losing earlier changes. This ensures the configuration stays consistent and prevents accidental resets.
  4. After updating, restart the chunker engines to apply the new configuration.
  5. You may need to restart your kernel for changes to be returned as modified.

Step 1: Copy the parser specs to your notebook

Copy the parser spec blocks you want to change from the sections below into your Jupyter notebook.

Python
text_parser_spec = {
    # (Insert your custom text_parser pipeline here)
}

Step 2: Edit only the values you want to change

Modify parameters in the copied specs as needed for your environment or use case.

Go to the How to change your configs section in the left-hand side bar under Configurations for a detailed explanation of how to change specific mew3 configurations. For each spec, you must specify the file format keys first. A file format maps to a list of dictionaries, where each dictionary represents a processing step. During parsing, this list of steps is run sequentially. Those steps are modules. Select All available modules in the left-hand sidebar for the corresponding module to specify.

Changing configs in mew3

Step 3: Apply all changes together

Apply your changes by passing all updated specs at once to mergeSettings().

Python
c3.Genai.UnstructuredPipeline.forId('myPipeline').mergeSettings({
    'parserSettings': {
        'pipelineSpec': {
            'text_parser_spec': text_parser_spec
        }
    }
})

Step 4: Add new or updated specs in future updates

When you update another spec later (for example, layout_parser_spec), include all specs you want to set in the same call:

Python
c3.Genai.UnstructuredPipeline.forId('myPipeline').mergeSettings({
    'parserSettings': {
        'pipelineSpec': {
            'text_parser_spec': text_parser_spec,
            'image_parser_spec': image_parser_spec,
            'layout_parser_spec': layout_parser_spec
        }
    }
})

This method ensures the configuration remains consistent and avoids accidental resets.

Updating what content Mew3 parses

Mew3 supports selective parsing of text, tables, and images. By default, it processes all content types. To skip specific types, update the following configuration fields:

JSON
"skipImageParsing": false,
"skipTableParsing": false,
"skipTextParsing": false

Set a field to true to skip that content type:

  • skipImageParsing: Skips image extraction.
  • skipTableParsing: Skips table parsing.
  • skipTextParsing: Skips text extraction.

For example, to extract only tables and images:

JSON
"skipTextParsing": true,
"skipTableParsing": false,
"skipImageParsing": false

Adjust these settings to reduce processing time or focus on specific content types that are more relevant for your use case.

Steps to update

Run the following in the Application C3 AI Console, adjusting the values for your use case:

Text
```js
Genai.UnstructuredPipeline.forId('myPipeline').mergeSettings({ parserSettings: { skipTextParsing: true, skipTableParsing: false, skipImageParsing: false } });
```

Parsing Presets

Mew3 provides built-in presets that let you select a parsing strategy suited to your document type without manually constructing a pipeline spec. Each preset configures the pipelineSpec and documentSpec together as a starting point. Any individual fields you set explicitly, such as llmClient or chunkSize, always take precedence over the preset values.

PresetBest forCPU costToken cost
defaultMost documents — balanced extraction with table and image verbalization4/52/5
table_focusedTable-heavy documents requiring full CSV and HTML extraction4/55/5
text_heavyText-dominant PDFs — uses PLUMBER for text instead of OCR2/52/5
pure_vlmEnvironments without Docling — LLM-only layout parsing1/55/5
no_verbalizationSpeed-first processing — skips all LLM calls for images and tables4/50/5

To apply a preset, run the following in the Application C3 AI Console:

Text
```js
Genai.UnstructuredPipeline.forId('myPipeline').mergeSettings({ parserSettings: { parsingPreset: 'text_heavy' } });
```

To inspect what the preset resolves to without persisting it, use Genai.ParserSettings#getParserSettings:

Python
settings = c3.Genai.ParserSettings.make({'parsingPreset': 'text_heavy'})
resolved = c3.Genai.ParserSettings.getParserSettings(settings)
print(resolved['pipelineSpec'])
print(resolved['documentSpec'])

Example: Table-heavy documents

Use the table_focused preset for documents with complex tables, such as financial reports or manufacturing specifications. This preset runs full CSV and HTML extraction alongside LLM verbalization, giving the most complete table representation at a higher token cost.

Text
```js
    Genai.UnstructuredPipeline.forId('myPipeline').mergeSettings({ parserSettings: { parsingPreset: 'table_focused' } });
```

Example: Simple text and PDF documents

Use the text_heavy preset for documents that are primarily text with minimal tables or images. This preset uses PLUMBER for text extraction instead of OCR, which is faster and reduces both CPU and token usage. On CPU-only nodes, combine it with disableOcrOnCpu to avoid OCR overhead entirely.

Text
```js
Genai.UnstructuredPipeline.forId('myPipeline').mergeSettings({ parserSettings: { parsingPreset: 'text_heavy' }});
```

Combining presets with manual overrides

You can use a preset as a baseline and override individual sub-specs without losing the rest of the preset. Preset merging operates per key: your value takes precedence if non-empty, otherwise the preset fills it in. This applies to both pipelineSpec and documentSpec.

Use Genai.ParserSettings#resolvePreset to see the merged result of a preset and your overrides:

Python
# Use pure_vlm preset but use a minimal table parser
settings = c3.Genai.ParserSettings.make({
    'parsingPreset': 'pure_vlm',
    'pipelineSpec': {
        'table_parser_spec': {
            'pdf': [
                {'kind': 'FILTER', 'kwargs': {'column': 'markdown_class', 'include': ['TABLE']}}
            ]
        }
        # image_parser_spec, text_parser_spec, etc. are inherited from pure_vlm
    }
})
resolved_settings = c3.Genai.ParserSettings.resolvePreset(settings)

The same per-key rule applies to documentSpec. To override only text cropping while inheriting the rest from text_heavy:

Python
settings = c3.Genai.ParserSettings.make({
    'parsingPreset': 'text_heavy',
    'documentSpec': {'cropping': {'text': 'OCR'}}
    # image and table cropping are inherited from text_heavy
})
resolved_settings = c3.Genai.ParserSettings.resolvePreset(settings)

Performance Tuning

Benchmarks by document type

Based on profiling across representative document sets on a GPU node:

Document typeGPU (per page)
PSR / mixed documents~8.3 seconds
Table-heavy documents~11.8 seconds

The table-heavy path is slower because table processing involves multiple sequential steps: layout detection, CSV extraction, and LLM verbalization. CPU-only nodes run approximately 25% slower than the GPU figures above.

The top bottlenecks on table-heavy documents, from profiling data, are:

Text
1) CSV_EXTRACT.DOCLING    26.8%
2) LAYOUT_MODEL.DOCLING   15.6%
3) CSV_EXTRACT.OPENAI     14.5%
4) TABLE_VERBAL.OPENAI    13.0%
5) TEXT_EXTRACTOR.DEFAULT  8.3%

Quick wins

The most impactful single change is choosing the right preset for your document type. Beyond that:

  • Lower llmTokenLimit — The default is 4096 tokens. Table verbalizations average around 350 tokens and image verbalizations around 200. Reducing this limit cuts LLM call latency by approximately 10% on table-heavy documents. Only use this when CSV_EXTRACT.DOCLING is not in your pipeline, because the same token limit applies to CSV extraction and a lower value may truncate long tables.

    JavaScript
    Genai.UnstructuredPipeline.forId('myPipeline').mergeSettings({ parserSettings: { llmTokenLimit: 512 } });
  • Disable OCR on CPU — Setting disableOcrOnCpu skips OCR when no GPU is available and falls back to available text extraction methods. This avoids the significant CPU overhead of running OCR and is recommended for CPU-only environments.

    JavaScript
    Genai.UnstructuredPipeline.forId('myPipeline').mergeSettings({ parserSettings: { disableOcrOnCpu: true } });
  • Tune LLM thread countmaxTaskThreads (default 13) controls how many LLM calls run in parallel. Increase this value if your LLM endpoint supports higher concurrency. Decrease it if you encounter rate limit errors.

    JavaScript
    Genai.UnstructuredPipeline.forId('myPipeline').mergeSettings({ parserSettings: { maxTaskThreads: 20 } });
  • Increase Docling parallelism with maxTaskProcesses — Controls how many parallel processes Docling uses internally. The default is 1 (no parallelism). Increasing this can speed up layout parsing on multi-core machines but raises memory consumption. Set to a negative number to let Docling manage the process count automatically.

    JavaScript
    Genai.UnstructuredPipeline.forId('myPipeline').mergeSettings({ parserSettings: { maxTaskProcesses: 4 } });
  • Tune PDF page batch size with pdfPageChunksSize — Controls how many pages are grouped into a single parallel task when processing PDFs. The default is 8. Increase to reduce task-dispatch overhead on large documents, or decrease for more fine-grained parallelism on documents with heavy per-page processing.

    JavaScript
    Genai.UnstructuredPipeline.forId('myPipeline').mergeSettings({ parserSettings: { pdfPageChunksSize: 4 } });

GPU vs CPU

GPU is strongly recommended when using the default Docling-based layout parsing. Layout parsing benchmarks show approximately 3.1 pages per second on an L40S GPU. T4 GPUs, which are common in standard cluster environments, achieve approximately 0.56 pages per second on the same step — still significantly faster than CPU. For cost-sensitive or CPU-only environments, the pure_vlm preset eliminates Docling entirely by using LLM-only layout parsing, which reduces CPU load at the cost of higher token usage.

On CPU-only nodes, the most reliable way to improve throughput is to scale the number of task nodes so that chunking is distributed across them. For guidance on scaling, see Troubleshoot Multimodal Parsing.

Image and Table Verbalization

Mew3 calls an LLM to generate text descriptions (verbalizations) of images and tables. The following fields control image preparation and verbalization behavior.

imageExtractionDpi

Sets the resolution used when extracting images from PDFs. Higher values produce sharper images, which improves LLM verbalization accuracy for diagrams and charts, but increases processing time and storage.

DPIQuality
72Low
144Standard (default)
170Good
300High
JavaScript
Genai.UnstructuredPipeline.forId('myPipeline').mergeSettings({ parserSettings: { imageExtractionDpi: 300 } });

llmMaxImageBytes

Sets the maximum byte size of an image sent to the LLM for verbalization. Images larger than this threshold are automatically downsampled before being passed to the model. The default is 7000000 (7 MB). Reduce this value if your LLM model enforces a lower image size limit.

JavaScript
Genai.UnstructuredPipeline.forId('myPipeline').mergeSettings({ parserSettings: { llmMaxImageBytes: 3000000 } });

imageVerbalizationPrompt and tableVerbalizationPrompt

Control the prompts used for LLM-based verbalization of images and tables. The defaults reference the named prompts image_verbalization_mew3_prompt and table_verbalization_mew3_prompt. Override these to customize the verbalization instructions for your document domain.

JavaScript
Genai.UnstructuredPipeline.forId('myPipeline').mergeSettings({
    parserSettings: {
        imageVerbalizationPrompt: { id: 'my_custom_image_prompt' },
        tableVerbalizationPrompt: { id: 'my_custom_table_prompt' }
    }
});

Inspecting the Active Parser Configuration

Before customizing the pipeline or triggering a chunking run, you can inspect the fully resolved configuration that Mew3 will use at runtime. This is useful for verifying that your preset and any manual overrides are applied correctly.

Use Genai.ParserSettings#getParserSettings in a JupyterLab notebook with the py-mew3 runtime to retrieve the complete, effective configuration:

Python
settings = c3.Genai.ParserSettings.make({'parsingPreset': 'default'})
resolved = c3.Genai.ParserSettings.getParserSettings(settings)

The returned object has two keys:

  • pipelineSpec — The fully hydrated pipeline configuration, organized by component (LayoutParserCfg, TextParserCfg, ImageParserCfg, TableParserCfg, TextChunkerCfg, ImageChunkerCfg, TableChunkerCfg) and then by file type (pdf, docx, and so on). All default values are expanded.
  • documentSpec — The effective document-level options such as cropping strategies, with all preset defaults resolved.

Example output:

JSON
{
  "pipelineSpec": {
    "LayoutParserCfg": {
      "pdf": {
        "pipeline": [
          "DOCLING_OR_VLM",
          {
            "kind": "FILTER_DOC",
            "kwargs": {
              "bbox_class_threshold": {
                "Caption": 0.2,
                "Picture": 0.5,
                "Table": 0.2,
                "Text": 0.2,
                "..."
              }
            }
          },
          "FILTER_EMPTY_TEXT_BBOXES",
          "UNION_BOX",
          "NO_OVERLAP",
          "ORDER_BY_COLUMN(grid_step=150)",
          "TO_MARKDOWN_CLASS",
          "MERGE_LIST",
          "..."
        ]
      },
      "docx": { "..." },
      "..."
    },
    "TextParserCfg": { "..." },
    "TableParserCfg": { "..." },
    "ImageParserCfg": { "..." },
    "TextChunkerCfg": { "..." },
    "TableChunkerCfg": { "..." },
    "ImageChunkerCfg": { "..." }
  },
  "documentSpec": {
    "image_crop": "AUTO",
    "text_crop": "AUTO",
    "table_crop": "AUTO",
    "link_crop": "AUTO",
    "page_cnt": "AUTO",
    "grid_gen": "AUTO",
    "toc": "AUTO"
  }
}

Filter by file type

To inspect the configuration for a specific file type, pass the file extension as the second argument:

Python
resolved_pdf = c3.Genai.ParserSettings.getParserSettings(settings, 'pdf')
resolved_docx = c3.Genai.ParserSettings.getParserSettings(settings, 'docx')

When filtered, each component in pipelineSpec returns the settings for that file type directly rather than a per-file-type map. Passing an unsupported extension raises a ValueError listing the supported types.

Verify preset and override resolution

If your configuration uses a parsingPreset with manual overrides, getParserSettings() shows the exact merged result. This is the recommended way to confirm your overrides took effect before running a chunking job:

Python
settings = c3.Genai.ParserSettings.make({
    'parsingPreset': 'text_heavy',
    'documentSpec': {'cropping': {'text': 'OCR'}}
})
resolved = c3.Genai.ParserSettings.getParserSettings(settings)

# text_crop is 'OCR'           (user override wins)
# table_crop is 'PLUMBER_TEXT' (inherited from text_heavy preset)
# image_crop is 'AUTO'         (inherited from text_heavy preset)
print(resolved['documentSpec'])

Resolve preset without full inspection

If you need the resolved Genai.ParserSettings object itself rather than the introspection JSON, use Genai.ParserSettings#resolvePreset directly:

Python
resolved_settings = c3.Genai.ParserSettings.resolvePreset(settings)

This returns the same type as the input with the preset merged in. The chunker calls this automatically at runtime; use it manually when you need to pass the merged settings object to another API.

Advanced Configuration

doHierarchyExtraction

When enabled, Mew3 extracts the hierarchical document structure during parsing, capturing parent-child relationships between headings and content blocks. Disabled by default.

JavaScript
Genai.UnstructuredPipeline.forId('myPipeline').mergeSettings({ parserSettings: { doHierarchyExtraction: true } });

disableComponentOnFailure

When set to true, if a pipeline component fails (for example, a parser step throws an exception), Mew3 disables that component and continues processing rather than stopping entirely. This is useful in production environments where partial results are preferable to a complete processing failure. Disabled by default.

JavaScript
Genai.UnstructuredPipeline.forId('myPipeline').mergeSettings({ parserSettings: { disableComponentOnFailure: true } });

airgapCacheLocalToRemote

Maps local file system paths to remote storage locations. On startup, Mew3 copies files from the specified remote locations to the corresponding local paths. This allows airgapped environments (without external internet access) to cache models and data from an internal object store such as GCS or S3.

The value is a JSON object mapping local paths to remote URIs:

JavaScript
Genai.UnstructuredPipeline.forId('myPipeline').mergeSettings({
    parserSettings: {
        airgapCacheLocalToRemote: {
            '/home/c3/nltk_data': 'gcs://my-bucket/nltk_data/'
        }
    }
});

See Also

Was this page helpful?