C3 AI Documentation Home

Configure the Source System and Source Collection

This topic explains how to configure the source system and source collection when building a Data Integration pipeline in Data Fusion. This topic walks through selecting a data connector, configuring the source system, defining the source collection, and specifying the source path. It also explains how to manage files and data directly from the source path, outlines key considerations for file-based, cloud storage, and SQL sources, and describes how to synchronize source files before proceeding to the next pipeline step.

Open the Data Integration Canvas

The Data Integration canvas is the central workspace for building ingestion pipelines in Data Fusion. It provides a visual representation of how data flows from a source system through schema definition and transformation into a target entity.

Each node on the canvas represents a configuration stage (Source System, Source Collection, Schema, Transform, and Target). The canvas dynamically guides you through valid next steps, ensuring the pipeline is constructed in the correct sequence.

Begin by opening the Data Integration canvas to create or manage your ingestion pipelines.

Steps

  1. From the app, navigate to Data Fusion → Data Integration.
  2. Ensure you are on the Data Integration tab.
  3. Click + Add Data Source in the upper-right corner of the canvas.

Select a Data Connector

Selecting a data connector determines how Data Fusion will communicate with the external system. Each connector encapsulates the protocol, authentication model, and data access pattern specific to that system (for example, SQL-based, file-based, or API-based access).

Choosing the correct connector ensures the platform can correctly interpret how data is exposed and apply the appropriate integration logic. An incorrect selection may result in incompatible configuration fields or unsupported connection behavior.

Steps

  1. In the Select a Data Connector window, browse or search for the desired connector.
  2. Select the source system type (for example, Amazon S3, Snowflake, BigQuery, or another supported system).
  3. Click the source system tile to proceed.

Configure the Source System

Configuring the Source System establishes a secure, authenticated connection between Data Fusion and the external data platform. This step defines how the system is identified, how Data Fusion connects to it, and verifies that the connection is reachable. Without this configuration, Data Fusion cannot access or list datasets from the external source. Once validated, the Source System becomes a reusable connection that supports one or more Source Collections within the pipeline.

Steps

  1. In the Configure Data Connector window, provide:

    • Name – A unique, user-defined name for the source system.
    • Description – Optional context about the data or usage.
  2. Under Configuration, enter the required connection details
    (for example, access keys, bucket root path, region, or SSL settings depending on the connector).

  3. Click Test Connection to validate connectivity.

  4. Click Complete to add the source system to the canvas.

  5. Once saved, the source system appears as a node on the canvas with a successful connection indicator.

Configure Source Collection

A Source Collection defines the logical dataset that will be ingested from the external system. While the Source System establishes connectivity, the Source Collection specifies what subset of data should be processed.

Choosing between structured and unstructured determines how Data Fusion will interpret the data—either as tabular records (rows and columns) or as file/document content. This distinction impacts schema handling, transformation capabilities, and downstream processing behavior.

Steps

  1. In the canvas, select Add the Source Collection.
  2. In the Choose type of source collection dialog:
    • Select Structured source collection for tabular data.
    • Select Unstructured source collection for files or document-based data.

Configure the Source Path

The Source Path identifies the exact location of the data within the connected system. It scopes the ingestion to a specific directory, file, table, or object prefix.

Defining the correct path ensures the pipeline reads only the intended data and prevents accidental ingestion of unrelated or sensitive content. Reusing an existing path promotes consistency across pipelines, while defining a new one allows for targeted or incremental ingestion.

Steps

  1. In the Configure Source Collection dialog, choose one of the following:

    • Select New Source Path to define a new path.
    • Use an existing Source Path to reuse a previously defined path.
  2. Provide a Name (file) for the source collection.

  3. Browse and select the directory or file location from the source system.

  4. Click Save.

  5. The source path node is added to the canvas and connected to the source system.

Manage Data Directly from the Source Path

When configuring a Source Collection, the Data Fusion UI provides a unified explorer that lets you review and interact with your source data—whether the underlying storage is file-based, cloud-based, or SQL. This allows you to perform essential source-management tasks without leaving the pipeline configuration workflow.

For file‑based and cloud storage sources

  • Create folders to organize incoming datasets.
  • Upload files directly into the selected source path (for supported cloud/file systems).
  • Delete files that are outdated or no longer needed.
  • Right‑click items to access supported CRUD operations from the context menu.

For SQL sources

While SQL sources do not support file-level operations, you can:

  • Preview table contents
  • Validate schema and field structures
  • Confirm row counts or sampling prior to pipeline execution

These capabilities streamline pipeline setup and testing by reducing dependency on external storage tools. Instead of switching to the cloud provider console (for example, S3 or Azure Blob), you can:

  • Quickly prepare or upload sample datasets for development or validation (file/cloud sources).
  • Organize folder structures before ingestion to avoid clutter and errors.
  • Remove incorrect or stale data to prevent unintended processing.
  • Validate SQL tables or schemas directly from the configuration flow.
  • Maintain a clean, ready-to-ingest source environment before running a pipeline.

This is particularly useful during:

  • Initial pipeline configuration
  • Testing transform logic
  • Reprocessing specific datasets
  • Demonstrations or proof-of-concept setups

Important Considerations

Operations reflect the real storage location.

  • Deleting files removes them from the underlying file system or cloud bucket.
  • SQL sources do not support destructive operations in this panel.
  • If files are deleted, they will no longer be available for ingestion.

After adding or removing files, you may need to Sync the Source Collection to refresh the file list before execution.

Capabilities differ by source type.

  • File and cloud sources support structural organization and file transfers.
  • SQL sources support data preview and structural validation only.

By offering lightweight, context-aware source management across file, cloud, and SQL systems, Data Fusion reduces friction during development while keeping full control over how data enters your pipelines.

Next Step

After you create a FileSourceSystem → FileSourceCollection, Data Fusion expects you to define:

  • the schema,
  • fields,
  • metadata,
  • and how the pipeline should interpret the raw data.

This source schema node appears automatically because the platform assumes that after defining the Source Collection, the next required step is to configure the schema.

Sync Source Files

After creating a Source Collection, use Sync source files to discover and register files from the configured location.

When a Source Collection is added for the first time, you must run Sync source files. Until a sync is performed, the Source Collection will not display any files.

Syncing performs the following actions:

  • Scans the configured path
  • Registers available files in the Source Collection
  • Detects newly added, modified, or removed files

After the initial sync, running this action is optional and typically used to refresh the file list before executing the pipeline.

Was this page helpful?