C3 AI Documentation Home

Sync and Process Files

The C3 Agentic AI Platform provides functionality to sync and process file data. Use the SourceFile Type to trigger data pipelines to integrate file data into the platform.

The platform allows you to load data from files with any of the following MIME extensions:

  • .csv

  • .json

  • .xml

  • .parquet

  • .avro

Sync files

You can use FileSourceSystem.fetch() and FileSourceCollection.fetch() respectively to see the FileSourceSystem and FileSourceCollection instances that are available in your application.

To sync file metadata, you need to ensure that the FileSourceCollection that you are using to sync the file data is pointing to the intended directory on the connected file system:

JavaScript
var fscName = <your_file_source_collection_name>;

FileSourceCollection.forName(fscName).inboxUrl();

By default, the inbox URL for a FileSourceCollection instance is a concatenation of the root URL of the associated FileSourceSystem instance and the name of the FileSourceCollection instance.

To change the inbox URL associated with a given FileSourceCollection instance, you can:

  • Override the root URL for the FileSourceSystem instance that the FileSourceCollection instance is connected to

  • Override the inbox URL for the FileSourceCollection instance directly

These URLs can be overridden on the metadata definitions in the package. See Declare Pipelines for File Sources.

Sync file metadata

You must sync file metadata before integrating file data into the platform. Syncing file metadata registers the file with the platform server.

After the inbox URL for the FileSourceCollection instance is pointing to the correct location, you can sync the file metadata in the platform using the SourceFile Type.

Before syncing files, it is recommended that you check to make sure which/how many files are synced.

JavaScript
FileSystem.countFiles(FileSourceCollection.forName(fscName).inboxUrl());
// Optionally list files
FileSystem.listFiles(FileSourceCollection.forName(fscName).inboxUrl()).files;

You can sync new files by URL or by reference.

Use the SourceFile.syncAll() to sync new files by URL.

JavaScript
// Sync all files using a URL
SourceFile.syncAll(FileSourceCollection.forName(fscName).inboxUrl());

Files can also be re-synced after the associated SourceFile metadata has been created with the SourceFile.syncFiles() method.

Validate file schemas

Before processing file data, it is a good practice to validate that the file schema matches the expected source schema. The File Type contains various methods to allow you to read the contents of a file. The following method extracts the first line of a synced file:

JavaScript
// Selecting just the first SourceFile instance
var sf = SourceFile.fetch().first();

C3.File.make(sf.contentLocation).readFirstLine();

You can compare this to the Source Type that is associated with the SourceCollection instance by inspecting the fields of the Source Type:

JavaScript
c3ShowType(FileSourceCollection.forName(fscName).source);

Process file data

After you have synced your files, the contents are ready to be loaded into the platform using the SourceFile Type. The SourceFile.process() method can be used to process an individual file:

JavaScript
// This example selects just the first SourceFile instance
var sf = SourceFile.fetch().first();

sf.process();

// Check the state of the SourceFile
sf.state();

It is also possible to process several files simultaneously using the SourceFile.processBatch() or SourceFile.processAll() methods:

JavaScript
// Select a subset of SourceFile instances. The limit is 100 for a given FileSourceCollection
var sfs = SourceFile.fetch({filter: Filter.eq("sourceCollectionName", fscName).toString(), limit: 100}).objs;

// Process the batch
SourceFile.processBatch(sfs);

// Process remaining SourceFile instances
SourceFile.processAll();

These methods initiate an asynchronous batch data loading activity. For more information on monitoring large-scale data loads, see Monitoring Data Loads.

Change process behavior

The sync and process APIs also allow you to pass a DataIntegSpec object as an argument. Passing this spec allows you to specify the data loading option declared at runtime. Passing in a DataIntegSpec object allows you to:

  • Specify the priority of the processing job

  • Disable source chunking for just this activity

  • Override the content extension of the files being processed

  • Specify the number of files to be processed in each batch

See the DataIntegSpec Type definition for a complete list of processing configurations that you can apply at runtime.

See also

Was this page helpful?