Multimodal Parsing

When you process different files, you often need to extract more than just text. Key information may appear in tables, images, or page layouts. To handle this, multimodal parsing combines layout detection with content segmentation. A layout parser first analyzes the visual structure of each page. It then uses bounding boxes to identify and separate different types of content. This approach enables accurate extraction of diverse content types and supports advanced indexing, retrieval, and search workflows.

Introduction to Mew3

Mew3 Genai.SourceFile.Chunker.Mew3 is a chunking component that supports multimodal parsing. It segments documents into structured, meaningful units using the layout structure to extract text, images, and tables. Mew3 improves the quality and consistency of parsing by integrating layout-based segmentation directly into the parsing pipeline.

By default, Mew3 uses the Docling parser to perform layout analysis. Docling is an open-source toolkit that supports various document formats, including PDF, DOCX, HTML, and images. It uses models such as DocLayNet and TableFormer to detect visual structures and identify tables. This enables Mew3 to obtain accurate bounding boxes and use them to segment content for indexing and search.

Mew3 operates as a modular pipeline. It uses producers to extract layout elements and executors to apply additional processing. This design makes Mew3 easy to configure, debug, and extend for different use cases.

Setting up Mew3

Ensure the environment and application are running with valid large language model credentials. To set the credentials, see Application Initialization.

When you run the quick start with the required model, the platform automatically enables Genai.SourceFile.Chunker.Mew3 as the default chunking component for unstructured files.

To confirm that the application is using Mew3 as the chunking component for unstructured files, run the following command in the Application C3 AI Console:

JavaScript

var chunkerConfig = Genai.SourceFile.Chunker.UniversalChunker.Config.forConfigKey('default');
var fileExtToChunkerSpecMap = C3.Map.fromJson(chunkerConfig.fileExtToChunkerSpecMap);
fileExtToChunkerSpecMap.get('.pdf').get('chunker');

The output should display Genai.SourceFile.Chunker.Mew3, confirming that the configured chunking component is active.

If the Mew3 chunking component is not set up by default, you can follow the steps below.

Enable Mew3

Run the following in the Application C3 AI Console to enable Genai.SourceFile.Chunker.Mew3 for multimodal parsing:

JavaScript

Genai.QuickStart.enableMew3();

This command sets Mew3 as the default chunking component for the files. It uses all default settings required for operation.

Build Mew3 Documentation

Run the following in the Application C3 AI Console to build the documentation for the Mew3 library. This command will take a few minutes to complete.

JavaScript

Genai.SourceFile.Chunker.Mew3.showDocs();

Changing configs in mew3

Verify configuration

JavaScript

var chunkerConfig = Genai.SourceFile.Chunker.UniversalChunker.Config.forConfigKey('default');
var fileExtToChunkerSpecMap = C3.Map.fromJson(chunkerConfig.fileExtToChunkerSpecMap);
fileExtToChunkerSpecMap.get('.pdf').get('chunker');

The output should display Genai.SourceFile.Chunker.Mew3, confirming that the configured chunker is active.

Node Pool Configuration

Mew3 automatically uses a GPU if one is available in your cluster. Run the following command in the Application C3 AI Console to verify GPU access on a task node:

JavaScript

C3.app().nodePool('task').config();

Review the value under the hardwareProfile.gpu field. The system sets this field to 0 when no GPU is configured on the task node.

If a GPU is not available, Mew3 automatically uses the CPU. Chunking is ~25% slower when running on a CPU, but you can increase performance by scaling the number of task nodes. For guidance on scaling task nodes and distributing chunking across nodes, refer to the Environment Sizing Guide.

Configuring Mew3

Mew3 supports more than 90 configuration options. However the default should be powerful enough for most use cases and you should only need to modify a few key settings to control model selection, update behavior, or resource usage.

Perform all configuration steps in this section using JupyterLab. To open a notebook, hover over the application card in C3 AI Studio and select Jupyter. Ensure the notebook runs with the py-mew3 runtime environment.

Access the Configuration

The following command returns the current configuration object:

Python

settings = c3.Genai.ParserSettings.make({'parsingPreset': 'default'})
c3.Genai.ParserSettings.getParserSettings(settings)

The configuration is organized into three main areas:

pipelineSpec — Controls all parser and chunker behavior for text, images, tables, and layout.
documentSpec - Specifies document processing and cropping strategies
skipParsing — Flags to enable or skip text, table, or image parsing.

Note: As of version 8.11, Genai.SourceFile.Chunker.Mew3.Config is deprecated in favor of Genai.ParserSettings. The examples below use Genai.ParserSettings where possible to reflect the recommended long-term API.

The following sections describe how to update specific parameters to customize Mew3 behavior for your environment. Each parameter includes examples and expected values.

Updating pipeline specs

Before you update the pipeline spec, read the Mew3 API documentation with the Genai.SourceFile.Chunker.Mew3.showDocs() command in the C3 AI Console of your application and read the How to change your configs section on the left-hand navigation.

Changing configs in mew3

Understanding the pipeline spec structure is crucial for customizing how Mew3 processes different document types. The pipelineSpec field on Genai.ParserSettings defines the complete parsing workflow, organized hierarchically:

Pipeline Specification → Component Specification (layout parser, image parser, text parser, table parser) → File Type (pdf, docx, xlsx) → Processing Steps (modules like DOCLING, FILTER_DOC, etc.)

Pipeline Specification
- Component Specification
  - Text Parser
  - Layout Parser
  - Image Parser
  - Table Parser
    - File Type
      - PDF
      - DOCX
      - XLSX
        Processing Steps
        DOCLING
        FILTER_DOC
        ...

Each component specification maps file formats to sequential processing steps. For example, the layout parser processes PDFs through steps like "DOCLING" for structure detection, "FILTER_DOC" for confidence filtering, "UNION_BOX" for merging overlapping elements, and "ORDER_BY_COLUMN" for reading order. Processing steps can be defined as simple strings like "UNION_BOX" or as dictionaries with "kind" (module name) and "kwargs" (parameters) for complex configurations.

Layout Parser spec

This pipeline determines how the system detects and processes layout elements on each page, including headers, text, tables, lists, and their positions and uses DOCLING under the hood.

Set bounding box thresholds

Mew3 uses a confidence threshold (through the FILTER_DOC step) to decide which layout elements to keep from the layout parser. By default, FILTER_DOC uses built-in thresholds. To customize these thresholds, replace the "FILTER_DOC" string with a dictionary that specifies bbox_class_threshold:

Python

{
    "kind": "FILTER_DOC",
    "kwargs": {
        "bbox_class_threshold": {
            "Caption": 0.2,
            "Footnote": 0.2,
            "Formula": 0.2,
            "List-item": 0.2,
            "Page-footer": 0.2,
            "Page-header": 0.3,
            "Picture": 0.5,
            "Section-header": 0.2,
            "Table": 0.2,
            "Text": 0.2,
            "Title": 0.2
        }
    }
}

These values set the minimum prediction confidence required to keep a bounding box for each layout class. You can adjust them if you observe unwanted content removal or retention:

Lower a value to keep more boxes of that type (for example, Picture: 0.6 → 0.4).
Raise a value to remove noisy or low-confidence elements (for example, Page-footer: 0.2 → 0.5).

Python


layout_parser_spec = {
    "pdf": [
    {
        "kind": "DOCLING",
        "kwargs": {}
    },
    "FILTER_DOC",
    "FILTER_EMPTY_TEXT_BBOXES",
    "UNION_BOX",  # Union boxes of same class
    "NO_OVERLAP",  # Remove overlapping boxes of same class
    "ORDER_BY_COLUMN(grid_step=150)",  # Specify column spacing
    "TO_MARKDOWN_CLASS",  # Convert raw layout classifications to standardized markdown types
    "MERGE_LIST",  # Combine consecutive list items into unified list blocks
    "MERGE_MANY_THINGS_BY_COLUMN",  # Group related content elements within the same column layout
    "RESOLVE_CLASS_CONFLICT(iou_threshold=0.85)", # merge two bbox of different class if IoU is large.
    "REMOVE_CONTAINED_BOXES(containment_threshold=0.9)", # when small bboxes are contained inside one large bbox, remove the small ones if overlapping area is large.
    "MERGE_TEXT_UNDER_HEADER", # For a given HEADER, merge all PLAIN below. Operates on per-pages level.
    "TREAT_IMAGE_FILLED_WITH_TEXT_AS_TEXT(threshold=0.7)",
    "GROW_BOX(markdown_class=['PLAIN', 'LIST', 'HEADER', 'CAPTION', 'TABLE', 'FIGURE'], rel_to='pixel', pixel_upper=2, pixel_lower=2, pixel_left=2, pixel_right=2)", # increase size of TABLE, FIGURE, or OTHER bbox
    "RANGE(column='global_idx')"
    ]
}

Use top-level threshold fields

As an alternative to modifying the FILTER_DOC step in the pipeline spec, Mew3 provides three top-level fields that set detection thresholds directly:

Field	Default	Applies to
`layoutDetectionThresholdPicture`	`0.5`	Picture elements
`layoutDetectionThresholdTable`	`0.2`	Table elements
`layoutDetectionThresholdText`	`0.2`	Body text only (not headers, captions, or footnotes)

JavaScript

Genai.UnstructuredPipeline.forId('myPipeline').mergeSettings({
    parserSettings: {
        layoutDetectionThresholdPicture: 0.6,
        layoutDetectionThresholdTable: 0.3
    }
});

layoutDetectionThresholdText applies only to detected body text. To tune thresholds for other text-based layout elements such as section headers, captions, or footnotes, configure the FILTER_DOC step in the pipelineSpec directly as described above.

Text parser spec

This pipeline parses text content such as plain text, headers, lists, and captions from the files. It filters relevant blocks and assigns structural context, including header hierarchy.

Python

text_parser_spec = {
    "pdf": [
        {
            "kind": "FILTER",  # only keep text
            "kwargs": {
                "column": "markdown_class",
                "include": ["HEADER", "PLAIN", "LIST", "CAPTION"],
            },
        },
        "TEXT_EXTRACTOR.DEFAULT",  # default: use cropping method
        {
            "kind": "GET_PARENT_HEADER_CACHED",  # get parent header for text by searching for previous bboxes
            "kwargs": {
                "extract_header_for_type": ["PLAIN", "LIST"],
                "header_parent_types": ["HEADER"],
                "enable_cross_page_search": true
            },
        },
        "TAG(column='parser', tag='TEXT')",  # add 'parser' field to bbox
    ]
}

This setup extracts useful text from the document, including headings, paragraphs, lists, and captions. It determines the appropriate heading for each part, removes elements that are not text, and labels the result so later steps recognize it as output from a text parser.

You usually do not need to change this unless you are creating a custom setup.

Image parser spec

This pipeline detects and processes images in documents. It extracts each image, collects nearby text such as captions, and uses a language model to generate a description of the image. The image and its description are saved for further use.

Python

image_parser_spec = {
    "pdf": [
        {
            "kind": "GET_SURROUNDING_TEXT_CACHED",  # extract text around image using adjacent bboxes
            "kwargs": {
                "current_bbox_typ": "FIGURE",
                "extract_bbox_with_offsets": [-3, -2, -1, 1, 2, 3],
                "extract_bbox_with_type": ["CAPTION", "PLAIN", "HEADER", "LIST"],
                "enable_cross_page_search": false
            }
        },
        {
            "kind": "CAPTION_FINDER.LOCALITY_CACHED",  # search for caption in adjacent bboxes
            "kwargs": {
                "current_bbox_typ": ["FIGURE"],
                "search_bbox_with_offsets": [-1, 1, -2, 2, -3, 3],  # represents the relative index to search for caption in sequence. Elements at front have higher priorities.
                "target_bbox_type": "CAPTION",
                "bbox_types_to_search_for": ["CAPTION", "PLAIN", "HEADER"],
                "enable_cross_page_search": false
            }
        },
        {
            "kind": "GET_PARENT_HEADER_CACHED",  # extract parent header information for context
            "kwargs": {
                "extract_header_for_type": ["FIGURE"],
                "header_parent_types": ["HEADER"],
                "enable_cross_page_search": true
            }
        },
        {
            "kind": "FILTER",  # only keep FIGURE
            "kwargs": {
                "column": "markdown_class",
                "include": ["FIGURE"]
            }
        },
        "IMAGE_EXTRACT",
        "IMAGE_VERBAL.LLM",  # disable verbalization: "IMAGE_VERBAL.NULL"
        "SAVE_IMAGE_C3",  # save image to FileSystem
        {
            "kind": "TAG",  # add 'parser' field to bbox
            "kwargs": {
                "column": "parser",
                "tag": "IMAGE"
            }
        }
    ]
}

Table parser spec

This pipeline finds tables in documents and uses a language model to summarize the contents of each table. It also captures surrounding text to provide context for each table and saves the output as a table image.

Python

table_parser_spec = {
    "pdf": [
        {
            "kind": "GET_SURROUNDING_TEXT_CACHED",  # extract text around table using adjacent bboxes
            "kwargs": {
                "current_bbox_typ": "TABLE",
                "extract_bbox_with_offsets": [-3, -2, -1, 1, 2, 3],
                "extract_bbox_with_type": ["CAPTION", "PLAIN", "HEADER", "LIST"],
                "enable_cross_page_search": false
            }
        },
        {
            "kind": "CAPTION_FINDER.LOCALITY_CACHED",  # search for caption in adjacent bboxes
            "kwargs": {
                "current_bbox_typ": ["TABLE"],
                "search_bbox_with_offsets": [-1, 1, -2, 2, -3, 3],  # represents the relative index to search for caption in sequence. Elements at front have higher priorities.
                "target_bbox_type": "HEADER",
                "bbox_types_to_search_for": ["HEADER", "CAPTION", "PLAIN"],
                "enable_cross_page_search": false
            }
        },
        {
            "kind": "GET_PARENT_HEADER_CACHED",  # extract parent header information for context
            "kwargs": {
                "extract_header_for_type": ["TABLE"],
                "header_parent_types": ["HEADER"],
                "enable_cross_page_search": true
            }
        },
        "RENAME(source='caption', target='title')",  # rename field on bbox
        {
            "kind": "FILTER",  # only keep TABLE
            "kwargs": {
                "column": "markdown_class",
                "include": ["TABLE"]
            }
        },
        "TABLE_IMAGE_EXTRACT",  # crop image from page
        "TABLE_VERBAL.LLM",  # non-LLM options: "TABLE_VERBAL.TITLE_AND_CONTENT", "TABLE_VERBAL.NULL"
        "SAVE_TABLE_IMAGE_C3",  # save table image to FileSystem
        "TAG(column='parser', tag='TABLE')"  # add 'parser' field to bbox
    ]
}

Text chunker spec

Splits long text blocks into smaller chunks so they can be indexed, searched, or used by language models more effectively.

Python

text_chunker_spec = {
    "pdf": [
        {
            "kind": "LONG_CHAIN",  # splits long text into overlapping chunks
            "kwargs": {
                "column_to_chunk": "text",
                "split_overlap": 200  # overlap between consecutive chunks in characters
            }
        }
    ]
}

split_overlap — Controls character overlap between chunks. Higher values (like 200) provide more context continuity but increase duplication. Lower values reduce duplication but may lose context at chunk boundaries.

Table chunker spec

Splits the LLM-generated description of a table into smaller chunks. This helps improve retrieval performance and allows language models to process long table summaries more efficiently.

Python

table_chunker_spec = {
    "pdf":[
        {
            "kind": "LONG_CHAIN",  # splits table verbalization into overlapping chunks
            "kwargs": {
                "column_to_chunk": "verbalization",
                "split_overlap": 200,  # overlap between consecutive chunks in characters
                "split_length": 1000   # maximum chunk size in characters
            }
        }
    ]
}

split_length — Set to 1000 characters to create larger chunks for table descriptions. Increase this to produce fewer, longer chunks for better performance, or decrease for more granular search precision.
split_overlap — Set to 200 characters for good context continuity between chunks. Decrease this to reduce content duplication and save storage space.

Configure chunk overlap and merging

`chunkOverlap`

Controls the number of characters that overlap between consecutive text chunks. Higher overlap preserves context across chunk boundaries but increases the embedding index size.

Default: 200

JavaScript

Genai.UnstructuredPipeline.forId('myPipeline').mergeSettings({ parserSettings: { chunkOverlap: 100 } });

`doMergeChunksWithSizeLimit`

When set, Mew3 merges adjacent small chunks into larger ones up to the specified character limit. This reduces fragmentation when documents contain many short paragraphs or list items. Enabled by default with a value of 1000. To change this attribute, see the code below.

JavaScript

Genai.UnstructuredPipeline.forId('myPipeline').mergeSettings({ parserSettings: { doMergeChunksWithSizeLimit: 750 } });

Update pipeline configurations

To configure Mew3, always follow this sequence:

Copy the configuration block for the section you want to change (for example, text_parser_spec, image_parser_spec, table_parser_spec, or layout_parser_spec).
Edit only the parameters you want to change.
When applying changes, always pass all updated specs together using setConfigValue(). Each update must include every previously customized spec to avoid losing earlier changes. This ensures the configuration stays consistent and prevents accidental resets.
After updating, restart the chunker engines to apply the new configuration.
You may need to restart your kernel for changes to be returned as modified.

Step 1: Copy the parser specs to your notebook

Copy the parser spec blocks you want to change from the sections below into your Jupyter notebook.

Python

text_parser_spec = {
    # (Insert your custom text_parser pipeline here)
}

Step 2: Edit only the values you want to change

Modify parameters in the copied specs as needed for your environment or use case.

Go to the How to change your configs section in the left-hand side bar under Configurations for a detailed explanation of how to change specific mew3 configurations. For each spec, you must specify the file format keys first. A file format maps to a list of dictionaries, where each dictionary represents a processing step. During parsing, this list of steps is run sequentially. Those steps are modules. Select All available modules in the left-hand sidebar for the corresponding module to specify.

Changing configs in mew3

Step 3: Apply all changes together

Apply your changes by passing all updated specs at once to mergeSettings().

Python

c3.Genai.UnstructuredPipeline.forId('myPipeline').mergeSettings({
    'parserSettings': {
        'pipelineSpec': {
            'text_parser_spec': text_parser_spec
        }
    }
})

Step 4: Add new or updated specs in future updates

When you update another spec later (for example, layout_parser_spec), include all specs you want to set in the same call:

Python

c3.Genai.UnstructuredPipeline.forId('myPipeline').mergeSettings({
    'parserSettings': {
        'pipelineSpec': {
            'text_parser_spec': text_parser_spec,
            'image_parser_spec': image_parser_spec,
            'layout_parser_spec': layout_parser_spec
        }
    }
})

This method ensures the configuration remains consistent and avoids accidental resets.

Updating what content Mew3 parses

Mew3 supports selective parsing of text, tables, and images. By default, it processes all content types. To skip specific types, update the following configuration fields:

JSON

"skipImageParsing": false,
"skipTableParsing": false,
"skipTextParsing": false

Set a field to true to skip that content type:

skipImageParsing: Skips image extraction.
skipTableParsing: Skips table parsing.
skipTextParsing: Skips text extraction.

For example, to extract only tables and images:

JSON

"skipTextParsing": true,
"skipTableParsing": false,
"skipImageParsing": false

Adjust these settings to reduce processing time or focus on specific content types that are more relevant for your use case.

Steps to update

Run the following in the Application C3 AI Console, adjusting the values for your use case:

Text

```js
Genai.UnstructuredPipeline.forId('myPipeline').mergeSettings({ parserSettings: { skipTextParsing: true, skipTableParsing: false, skipImageParsing: false } });
```

Parsing Presets

Mew3 provides built-in presets that let you select a parsing strategy suited to your document type without manually constructing a pipeline spec. Each preset configures the pipelineSpec and documentSpec together as a starting point. Any individual fields you set explicitly, such as llmClient or chunkSize, always take precedence over the preset values.

Preset	Best for	CPU cost	Token cost
`default`	Most documents — balanced extraction with table and image verbalization	4/5	2/5
`table_focused`	Table-heavy documents requiring full CSV and HTML extraction	4/5	5/5
`text_heavy`	Text-dominant PDFs — uses PLUMBER for text instead of OCR	2/5	2/5
`pure_vlm`	Environments without Docling — LLM-only layout parsing	1/5	5/5
`no_verbalization`	Speed-first processing — skips all LLM calls for images and tables	4/5	0/5

To apply a preset, run the following in the Application C3 AI Console:

Text

```js
Genai.UnstructuredPipeline.forId('myPipeline').mergeSettings({ parserSettings: { parsingPreset: 'text_heavy' } });
```

To inspect what the preset resolves to without persisting it, use Genai.ParserSettings#getParserSettings:

Python

settings = c3.Genai.ParserSettings.make({'parsingPreset': 'text_heavy'})
resolved = c3.Genai.ParserSettings.getParserSettings(settings)
print(resolved['pipelineSpec'])
print(resolved['documentSpec'])

Example: Table-heavy documents

Use the table_focused preset for documents with complex tables, such as financial reports or manufacturing specifications. This preset runs full CSV and HTML extraction alongside LLM verbalization, giving the most complete table representation at a higher token cost.

Text

```js
    Genai.UnstructuredPipeline.forId('myPipeline').mergeSettings({ parserSettings: { parsingPreset: 'table_focused' } });
```

table_focused runs both CSV_EXTRACT.DOCLING and CSV_EXTRACT.LLM in sequence. This produces the most accurate structured data extraction but has the highest token cost of all presets and is a very time costly operation.

Example: Simple text and PDF documents

Use the text_heavy preset for documents that are primarily text with minimal tables or images. This preset uses PLUMBER for text extraction instead of OCR, which is faster and reduces both CPU and token usage. On CPU-only nodes, combine it with disableOcrOnCpu to avoid OCR overhead entirely.

Text

```js
Genai.UnstructuredPipeline.forId('myPipeline').mergeSettings({ parserSettings: { parsingPreset: 'text_heavy' }});
```

Combining presets with manual overrides

You can use a preset as a baseline and override individual sub-specs without losing the rest of the preset. Preset merging operates per key: your value takes precedence if non-empty, otherwise the preset fills it in. This applies to both pipelineSpec and documentSpec.

Use Genai.ParserSettings#resolvePreset to see the merged result of a preset and your overrides:

Python

# Use pure_vlm preset but use a minimal table parser
settings = c3.Genai.ParserSettings.make({
    'parsingPreset': 'pure_vlm',
    'pipelineSpec': {
        'table_parser_spec': {
            'pdf': [
                {'kind': 'FILTER', 'kwargs': {'column': 'markdown_class', 'include': ['TABLE']}}
            ]
        }
        # image_parser_spec, text_parser_spec, etc. are inherited from pure_vlm
    }
})
resolved_settings = c3.Genai.ParserSettings.resolvePreset(settings)

The same per-key rule applies to documentSpec. To override only text cropping while inheriting the rest from text_heavy:

Python

settings = c3.Genai.ParserSettings.make({
    'parsingPreset': 'text_heavy',
    'documentSpec': {'cropping': {'text': 'OCR'}}
    # image and table cropping are inherited from text_heavy
})
resolved_settings = c3.Genai.ParserSettings.resolvePreset(settings)

This per-key merge is a change from earlier behavior, where setting any value in pipelineSpec caused the entire preset section to be ignored. Partial overrides now work as expected.

Performance Tuning

Benchmarks by document type

Based on profiling across representative document sets on a GPU node:

Document type	GPU (per page)
PSR / mixed documents	~8.3 seconds
Table-heavy documents	~11.8 seconds

The table-heavy path is slower because table processing involves multiple sequential steps: layout detection, CSV extraction, and LLM verbalization. CPU-only nodes run approximately 25% slower than the GPU figures above.

The top bottlenecks on table-heavy documents, from profiling data, are:

Text

1) CSV_EXTRACT.DOCLING    26.8%
2) LAYOUT_MODEL.DOCLING   15.6%
3) CSV_EXTRACT.OPENAI     14.5%
4) TABLE_VERBAL.OPENAI    13.0%
5) TEXT_EXTRACTOR.DEFAULT  8.3%

Quick wins

The most impactful single change is choosing the right preset for your document type. Beyond that:

Lower llmTokenLimit — The default is 4096 tokens. Table verbalizations average around 350 tokens and image verbalizations around 200. Reducing this limit cuts LLM call latency by approximately 10% on table-heavy documents. Only use this when CSV_EXTRACT.DOCLING is not in your pipeline, because the same token limit applies to CSV extraction and a lower value may truncate long tables.
JavaScript
```
Genai.UnstructuredPipeline.forId('myPipeline').mergeSettings({ parserSettings: { llmTokenLimit: 512 } });
```
Disable OCR on CPU — Setting disableOcrOnCpu skips OCR when no GPU is available and falls back to available text extraction methods. This avoids the significant CPU overhead of running OCR and is recommended for CPU-only environments.
JavaScript
```
Genai.UnstructuredPipeline.forId('myPipeline').mergeSettings({ parserSettings: { disableOcrOnCpu: true } });
```
Tune LLM thread count — maxTaskThreads (default 13) controls how many LLM calls run in parallel. Increase this value if your LLM endpoint supports higher concurrency. Decrease it if you encounter rate limit errors.
JavaScript
```
Genai.UnstructuredPipeline.forId('myPipeline').mergeSettings({ parserSettings: { maxTaskThreads: 20 } });
```
Increase Docling parallelism with maxTaskProcesses — Controls how many parallel processes Docling uses internally. The default is 1 (no parallelism). Increasing this can speed up layout parsing on multi-core machines but raises memory consumption. Set to a negative number to let Docling manage the process count automatically.
JavaScript
```
Genai.UnstructuredPipeline.forId('myPipeline').mergeSettings({ parserSettings: { maxTaskProcesses: 4 } });
```
Tune PDF page batch size with pdfPageChunksSize — Controls how many pages are grouped into a single parallel task when processing PDFs. The default is 8. Increase to reduce task-dispatch overhead on large documents, or decrease for more fine-grained parallelism on documents with heavy per-page processing.
JavaScript
```
Genai.UnstructuredPipeline.forId('myPipeline').mergeSettings({ parserSettings: { pdfPageChunksSize: 4 } });
```

GPU vs CPU

GPU is strongly recommended when using the default Docling-based layout parsing. Layout parsing benchmarks show approximately 3.1 pages per second on an L40S GPU. T4 GPUs, which are common in standard cluster environments, achieve approximately 0.56 pages per second on the same step — still significantly faster than CPU. For cost-sensitive or CPU-only environments, the pure_vlm preset eliminates Docling entirely by using LLM-only layout parsing, which reduces CPU load at the cost of higher token usage.

On CPU-only nodes, the most reliable way to improve throughput is to scale the number of task nodes so that chunking is distributed across them. For guidance on scaling, see Troubleshoot Multimodal Parsing.

Image and Table Verbalization

Mew3 calls an LLM to generate text descriptions (verbalizations) of images and tables. The following fields control image preparation and verbalization behavior.

`imageExtractionDpi`

Sets the resolution used when extracting images from PDFs. Higher values produce sharper images, which improves LLM verbalization accuracy for diagrams and charts, but increases processing time and storage.

DPI	Quality
`72`	Low
`144`	Standard (default)
`170`	Good
`300`	High

JavaScript

Genai.UnstructuredPipeline.forId('myPipeline').mergeSettings({ parserSettings: { imageExtractionDpi: 300 } });

`llmMaxImageBytes`

Sets the maximum byte size of an image sent to the LLM for verbalization. Images larger than this threshold are automatically downsampled before being passed to the model. The default is 7000000 (7 MB). Reduce this value if your LLM model enforces a lower image size limit.

JavaScript

Genai.UnstructuredPipeline.forId('myPipeline').mergeSettings({ parserSettings: { llmMaxImageBytes: 3000000 } });

`imageVerbalizationPrompt` and `tableVerbalizationPrompt`

Control the prompts used for LLM-based verbalization of images and tables. The defaults reference the named prompts image_verbalization_mew3_prompt and table_verbalization_mew3_prompt. Override these to customize the verbalization instructions for your document domain.

JavaScript

Genai.UnstructuredPipeline.forId('myPipeline').mergeSettings({
    parserSettings: {
        imageVerbalizationPrompt: { id: 'my_custom_image_prompt' },
        tableVerbalizationPrompt: { id: 'my_custom_table_prompt' }
    }
});

Inspecting the Active Parser Configuration

Before customizing the pipeline or triggering a chunking run, you can inspect the fully resolved configuration that Mew3 will use at runtime. This is useful for verifying that your preset and any manual overrides are applied correctly.

Use Genai.ParserSettings#getParserSettings in a JupyterLab notebook with the py-mew3 runtime to retrieve the complete, effective configuration:

Python

settings = c3.Genai.ParserSettings.make({'parsingPreset': 'default'})
resolved = c3.Genai.ParserSettings.getParserSettings(settings)

The returned object has two keys:

pipelineSpec — The fully hydrated pipeline configuration, organized by component (LayoutParserCfg, TextParserCfg, ImageParserCfg, TableParserCfg, TextChunkerCfg, ImageChunkerCfg, TableChunkerCfg) and then by file type (pdf, docx, and so on). All default values are expanded.
documentSpec — The effective document-level options such as cropping strategies, with all preset defaults resolved.

Example output:

JSON

{
  "pipelineSpec": {
    "LayoutParserCfg": {
      "pdf": {
        "pipeline": [
          "DOCLING_OR_VLM",
          {
            "kind": "FILTER_DOC",
            "kwargs": {
              "bbox_class_threshold": {
                "Caption": 0.2,
                "Picture": 0.5,
                "Table": 0.2,
                "Text": 0.2,
                "..."
              }
            }
          },
          "FILTER_EMPTY_TEXT_BBOXES",
          "UNION_BOX",
          "NO_OVERLAP",
          "ORDER_BY_COLUMN(grid_step=150)",
          "TO_MARKDOWN_CLASS",
          "MERGE_LIST",
          "..."
        ]
      },
      "docx": { "..." },
      "..."
    },
    "TextParserCfg": { "..." },
    "TableParserCfg": { "..." },
    "ImageParserCfg": { "..." },
    "TextChunkerCfg": { "..." },
    "TableChunkerCfg": { "..." },
    "ImageChunkerCfg": { "..." }
  },
  "documentSpec": {
    "image_crop": "AUTO",
    "text_crop": "AUTO",
    "table_crop": "AUTO",
    "link_crop": "AUTO",
    "page_cnt": "AUTO",
    "grid_gen": "AUTO",
    "toc": "AUTO"
  }
}

Filter by file type

To inspect the configuration for a specific file type, pass the file extension as the second argument:

Python

resolved_pdf = c3.Genai.ParserSettings.getParserSettings(settings, 'pdf')
resolved_docx = c3.Genai.ParserSettings.getParserSettings(settings, 'docx')

When filtered, each component in pipelineSpec returns the settings for that file type directly rather than a per-file-type map. Passing an unsupported extension raises a ValueError listing the supported types.

Verify preset and override resolution

If your configuration uses a parsingPreset with manual overrides, getParserSettings() shows the exact merged result. This is the recommended way to confirm your overrides took effect before running a chunking job:

Python

settings = c3.Genai.ParserSettings.make({
    'parsingPreset': 'text_heavy',
    'documentSpec': {'cropping': {'text': 'OCR'}}
})
resolved = c3.Genai.ParserSettings.getParserSettings(settings)

# text_crop is 'OCR'           (user override wins)
# table_crop is 'PLUMBER_TEXT' (inherited from text_heavy preset)
# image_crop is 'AUTO'         (inherited from text_heavy preset)
print(resolved['documentSpec'])

Resolve preset without full inspection

If you need the resolved Genai.ParserSettings object itself rather than the introspection JSON, use Genai.ParserSettings#resolvePreset directly:

Python

resolved_settings = c3.Genai.ParserSettings.resolvePreset(settings)

This returns the same type as the input with the preset merged in. The chunker calls this automatically at runtime; use it manually when you need to pass the merged settings object to another API.

Advanced Configuration

`doHierarchyExtraction`

When enabled, Mew3 extracts the hierarchical document structure during parsing, capturing parent-child relationships between headings and content blocks. Disabled by default.

JavaScript

Genai.UnstructuredPipeline.forId('myPipeline').mergeSettings({ parserSettings: { doHierarchyExtraction: true } });

`disableComponentOnFailure`

When set to true, if a pipeline component fails (for example, a parser step throws an exception), Mew3 disables that component and continues processing rather than stopping entirely. This is useful in production environments where partial results are preferable to a complete processing failure. Disabled by default.

JavaScript

Genai.UnstructuredPipeline.forId('myPipeline').mergeSettings({ parserSettings: { disableComponentOnFailure: true } });

`airgapCacheLocalToRemote`

Maps local file system paths to remote storage locations. On startup, Mew3 copies files from the specified remote locations to the corresponding local paths. This allows airgapped environments (without external internet access) to cache models and data from an internal object store such as GCS or S3.

The value is a JSON object mapping local paths to remote URIs:

JavaScript

Genai.UnstructuredPipeline.forId('myPipeline').mergeSettings({
    parserSettings: {
        airgapCacheLocalToRemote: {
            '/home/c3/nltk_data': 'gcs://my-bucket/nltk_data/'
        }
    }
});

Copy link to this sectionIntroduction to Mew3

Copy link to this sectionSetting up Mew3

Copy link to this sectionEnable Mew3

Copy link to this sectionBuild Mew3 Documentation

Copy link to this sectionVerify configuration

Copy link to this sectionNode Pool Configuration

Copy link to this sectionConfiguring Mew3

Copy link to this sectionAccess the Configuration

Copy link to this sectionUpdating pipeline specs

Copy link to this sectionLayout Parser spec

Copy link to this sectionSet bounding box thresholds

Copy link to this sectionUse top-level threshold fields

Copy link to this sectionText parser spec

Copy link to this sectionImage parser spec

Copy link to this sectionTable parser spec

Copy link to this sectionText chunker spec

Copy link to this sectionTable chunker spec

Copy link to this sectionConfigure chunk overlap and merging

Copy link to this sectionchunkOverlap

Copy link to this sectiondoMergeChunksWithSizeLimit

Copy link to this sectionUpdate pipeline configurations

Copy link to this sectionStep 1: Copy the parser specs to your notebook

Copy link to this sectionStep 2: Edit only the values you want to change

Copy link to this sectionStep 3: Apply all changes together

Copy link to this sectionStep 4: Add new or updated specs in future updates

Copy link to this sectionUpdating what content Mew3 parses

Copy link to this sectionSteps to update

Copy link to this sectionParsing Presets

Copy link to this sectionExample: Table-heavy documents

Copy link to this sectionExample: Simple text and PDF documents

Copy link to this sectionCombining presets with manual overrides

Copy link to this sectionPerformance Tuning

Copy link to this sectionBenchmarks by document type

Copy link to this sectionQuick wins

Copy link to this sectionGPU vs CPU

Copy link to this sectionImage and Table Verbalization

Copy link to this sectionimageExtractionDpi

Copy link to this sectionllmMaxImageBytes

Copy link to this sectionimageVerbalizationPrompt and tableVerbalizationPrompt

Copy link to this sectionInspecting the Active Parser Configuration

Copy link to this sectionFilter by file type

Copy link to this sectionVerify preset and override resolution

Copy link to this sectionResolve preset without full inspection

Copy link to this sectionAdvanced Configuration

Copy link to this sectiondoHierarchyExtraction

Copy link to this sectiondisableComponentOnFailure

Copy link to this sectionairgapCacheLocalToRemote

Copy link to this sectionSee Also

Introduction to Mew3

Setting up Mew3

Enable Mew3

Build Mew3 Documentation

Verify configuration

Node Pool Configuration

Configuring Mew3

Access the Configuration

Updating pipeline specs

Layout Parser spec

Set bounding box thresholds

Use top-level threshold fields

Text parser spec

Image parser spec

Table parser spec

Text chunker spec

Table chunker spec

Configure chunk overlap and merging

`chunkOverlap`

`doMergeChunksWithSizeLimit`

Update pipeline configurations

Step 1: Copy the parser specs to your notebook

Step 2: Edit only the values you want to change

Step 3: Apply all changes together

Step 4: Add new or updated specs in future updates

Updating what content Mew3 parses

Steps to update

Parsing Presets

Example: Table-heavy documents

Example: Simple text and PDF documents

Combining presets with manual overrides

Performance Tuning

Benchmarks by document type

Quick wins

GPU vs CPU

Image and Table Verbalization

`imageExtractionDpi`

`llmMaxImageBytes`

`imageVerbalizationPrompt` and `tableVerbalizationPrompt`

Inspecting the Active Parser Configuration

Filter by file type

Verify preset and override resolution

Resolve preset without full inspection

Advanced Configuration

`doHierarchyExtraction`

`disableComponentOnFailure`

`airgapCacheLocalToRemote`

See Also