Configure Runtime Parameters and Run a DI Pipeline
This section describes how to prepare a DI pipeline for running by reviewing the full pipeline on the canvas, saving the configuration, and setting all necessary runtime parameters, including file selection, CSV and chunking options, data‑handling rules, error‑handling behavior, and archive settings—before initiating the pipeline run.
Confirm the End-to-End Pipeline on the Canvas
Close the preview to return to the Data Integration canvas.
Verify that the pipeline now shows a complete, connected flow:
- Source System (for example, EMRSystem using S3)
- Source Collection (PatientRecord)
- Source Schema (PatientSource)
- Transform (PatientSource-Patient)
- Target Type (Patient)
Ensure each node displays a status indicator (for example, field counts such as 10 fields, 20 fields) confirming successful configuration.
Save the Pipeline Configuration
Now, the pipeline is persisted with:
- The selected target type
- All configured field mappings
- The validated transform logic
At this point, the pipeline definition is complete and ready for execution.
Configure Runtime Parameters
After defining the Source, Schema, and Transform, the pipeline structure is complete. However, execution behavior is not yet defined.
Runtime parameters control how the pipeline runs, not how it is built. These settings determine:
- Which files are processed in the current run
- Whether previously processed data should be reprocessed
- How parsing behaves at execution time
- How large files are handled
- How failures are tolerated
- What happens to files after processing
This step allows operators to control ingestion behavior without modifying the pipeline design. It provides flexibility for incremental loads, reprocessing scenarios, performance tuning, and operational recovery.
In short, pipeline configuration defines what to process and how data should be transformed. Runtime configuration defines how this specific execution should behave.
Open Runtime Configuration
On the canvas, locate the Source Collection node.
Click Run (▶) or select Execute pipeline from the node menu.
In the Configure Runtime Parameters modal, review and adjust settings across the available tabs.
Select Files to Process (Basic Tab)
Use the Basic tab to choose which files are included in the pipeline run.
Select Input Files
You may choose:
- All Files — Process all detected files.
- Select Files — Manually select specific files for this run.
- Review the list of available files detected at the configured source path.
- Use column filters (such as File Path, Size, Last Modified, or Processing Status) to narrow results.
- Select one or more files to include in the run.
Sync Source Collection
Click Sync Source Collection to refresh the file list from the underlying storage system.
Use this after adding, modifying, or deleting files to ensure the pipeline processes the latest source content.
Reprocess Options
Reset Pipeline Before Executing
Stops the current run state and clears queued work so files can be reprocessed from the beginning.
Use this when recovering from a failed run or restarting execution cleanly.
Handle Existing Data
You must choose how existing target data should be treated:
- Keep — Preserves previously ingested records. New data is appended or merged according to entity rules.
- Delete — The Delete option is provided to support development and testing scenarios where resetting the target state is necessary. It is intentionally disabled in production to prevent accidental data loss or interference with inflight ingestion operations.
Behavior Details
When the delete option is selected, Data that is already in progress may complete processing later and will not be removed immediately when using the delete option. To fully clear all existing target data, monitor the active run and ensure that the source queue has no remaining pending entries. Once the queue is empty, all processed data will be cleared as expected.
Configure CSV Parsing Options (CSV Configuration Tab)
If the Source Collection contains CSV files, use this tab to control how data is parsed. These settings override default parsing behavior for this execution only.
Configure CSV Settings
- Set the Delimiter (for example, comma or tab).
- Choose the Quote and Escape characters.
- Optionally specify a Header Override if incoming headers differ from the expected schema.
- Enable or disable CSV Header depending on whether the first row contains column names.
These settings apply only to the current execution unless published as part of configuration management.
Override Source URLs (URL Overrides Tab)
Use this tab to override where files are archived after processing.
Configure Archive Location
- Enter a custom Archive URL to control where processed files are moved.
- Enable External if file lifecycle management is handled outside of Data Fusion.
This is useful when downstream systems manage retention or cleanup.
Configure Chunking Behavior (Chunking Control Tab)
Chunking allows large files to be split into smaller units for parallel processing.
Enable and Tune Chunking
- Enable Chunking.
- Specify:
- Chunk Size (Records) to control batch size.
- Chunk Size (MB) to limit chunk size by file size.
- Optionally enable Clean Pending Chunks to remove incomplete chunks from prior runs.
Chunking improves throughput for large datasets and long-running pipelines.
Configure Error Handling (Error Handling Tab)
Control how the pipeline responds to processing failures.
Set Error Thresholds
Errors threshold: Specify how many errors are allowed before the pipeline aborts.
Use -1 to allow unlimited errors.Number of retries: Set how many retry attempts occur for failed write operations, such as version conflicts.
These settings help balance resilience and correctness during execution.
Content Processing
Defines content-level preprocessing options that apply before transformation.
Available settings may vary depending on the connector type.
Serialization Options
Controls how data serialization and deserialization are handled during ingestion.
This is particularly relevant when using custom content types or specialized formats.
Metadata
Defines how metadata is captured or applied during ingestion, such as preserving source file attributes or attaching processing timestamps.
File Operations
Specifies how files are handled after processing. Depending on configuration, files may:
- Remain in place
- Be moved to an archive location
- Be deleted after successful processing
Archiving
Configures how and where processed files are archived, including archive paths and retention handling.
Batch Settings
Controls batch-level execution parameters such as batch size limits, concurrency, and execution throttling.
Affected Targets
Displays the target entities or canonical types that will be modified during execution.
This provides visibility into the scope of data impact before running the pipeline.
Execute the Pipeline
Click Execute to start the pipeline run using the configured runtime settings.
During execution:
- File-level processing status updates in real time
- Errors are captured in run history
- File states transition based on success or failure
- A notification confirms that processing has started