Chunk Source Files
Data ingestion in C3 AI involves importing data from various sources into the platform for use in analytics, modeling, and other applications. The efficiency of this process is crucial for maintaining throughput, especially when dealing with large datasets. You can optimize data ingestion through greater parallelization by chunking source files differently.
In the context of data ingestion in C3 AI, a source refers to a file or dataset that is imported into the platform for processing, analysis, or integration into the application's data model. A source file can originate from various data sources, such as CSV files, JSON or other structured or unstructured data formats.
Parallel processing
This involves breaking down each source file into smaller chunks that can be processed simultaneously across multiple threads or processes. In C3 AI, you can configure the ingestion process to handle multiple data streams at once, significantly reducing the time required to ingest large volumes of data.
Chunking data
Chunking allows for more efficient processing by breaking down large files into smaller, more manageable pieces. By dividing each source file into smaller, manageable pieces (or chunks), each chunk can be ingested in parallel. This means that instead of waiting for one chunk to finish processing before starting the next, the system can work on multiple chunks simultaneously.
Using the chunkSizeInMb, minByteSizeToChunk parameters
The minByteSizeToChunk parameter is an important configuration option for chunking data in C3 AI. This parameter specifies the minimum size (in bytes) that a source file must meet for it to be eligible for chunking during the ingestion process. The purpose of this parameter minByteSizeToChunk is to control when data should be chunked or not during ingestion.
The parameter chunkSizeInMb indicates the chunk size in MiB that must be used for chunking the content and publishing to queue. This parameter helps define the granularity of data processing and can be adjusted based on performance requirements or constraints in resource usage.
The default chunk size value is set to 5 MiB (5,242,880 bytes). If a source file's content length is below this threshold, it is treated as a single unit. This is beneficial for smaller datasets, as chunking them may introduce unnecessary overhead.
In the following sections, you learn how to configure these parameters.
Configuring the chunk size
If the data should be chunked, configure the chunk size for data ingestion using the SourceCollection.Config type to optimize data processing and publishing. You can control chunk size using these additional fields on SourceCollection.Config.
chunkSize- Indicates the chunk size in terms of number of records that needs to be used for chunking the content and publishing to queue.doNotChunk– This overrides other chunking parameters. If doNotChunk is set to true, theminByteSizeToChunkparameter has no effect, and source files are not be chunked regardless of their size. This allows users to override the chunking behavior if desired.If the size of the data being ingested is smaller than the specified minimum byte size, the system processes it as a whole rather than in chunks. This is beneficial for smaller datasets, as chunking them may introduce unnecessary overhead.
There are two ways to set a SourceCollection.Config for your application. It can be set as part of the application code itself by creating a JSON configuration file in the config subdirectory of your application package. To change the config in a deployed application, you can utilize the .setConfig() method to modify the SourceCollection.Config in console. This is useful when you wish to have different settings for SNEs, QA, or prod environments.
Method 1 – Define the SourceCollection.Config instance
A SourceCollection.Config instance must share the same name as the FileSourceCollection that it is being defined for. In the following example, the name of the FileSourceCollection is assumed to be "MySourceCollection".
In the following file structure, make a file called "MySourceCollection.json".
C3-AI-application
|
|____config/
|_____ SourceCollection.Config/
|_____MySourceCollection.jsonNext, place the following content in "MySourceCollection.json":
{
"name": "MySourceCollection",
"minByteSizeToChunk": 724,
"chunkSize": 1000,
"chunkSizeInMb": 5
}The JSON example provided here is for illustrative purposes only. You may not need to set all three parameters simultaneously. Additionally, the name of the file should match the name of the SourceCollection.Config.
After saving the file, you should then be able to see the new configuration in console with the following commands:
// Get the FileSourceCollection instance
var fsc = FileSourceCollection.forName('MySourceCollection');
// Get the corresponding SourceCollection.Config
var config = fsc.config();
// Show the config in console
configMethod 2 – Load the configuration and utilize the APIs
A SourceCollection.Config can also be set or modified in the console of a deployed application with the Config#setConfig method. For example, assume that chunkSizeInMb is currently set to 5, and you wish to change it to 3. The following code example would allow you to do that for a FileSourceCollection called 'MySourceCollection':
// Get the FileSourceCollection instance
var fsc = FileSourceCollection.forName('MySourceCollection');
// Get the corresponding SourceCollection.Config
var config = fsc.config();
// Set the new config
config.setConfigValue('chunkSizeInMb', 3)
// Retrieve the new config to validate
config.getConfig()Summary
The minByteSizeToChunk and chunkSizeInMb parameters play a crucial role in managing data ingestion in C3 AI. By ensuring that only sufficiently large source files are chunked, it enhances performance and efficiency. Understanding and configuring this parameter correctly allows for optimized ingestion workflows tailored to specific data characteristics.