C3 AI Documentation Home

Virtualization

Data virtualization in the C3 Agentic AI Platform allows you to connect to external data sources and access source data directly without loading data into the platform. This allows you to build C3 AI applications that leverage the investments your organization has made in building out a data warehouse, data lake, or other data infrastructure. When virtualization is enabled, database queries issued by the application are pushed to the source system.

The platform includes dozens of out-of-the-box connectors and a simple framework for integrating new sources to use in an enterprise AI application. Virtualizing your data allows you to:

  • Take advantage of the encapsulation benefits of the Type system
  • Keep your data centralized in a unified federated image
  • Avoid incurring the extra compute and storage costs associated with a traditional ETL

The C3 Agentic AI Platform supports virtualization capabilities for dozens of source systems out of the box, including:

  • Data lake and data warehousing technologies like Snowflake, AWS Redshift, and Delta Lake/Databricks
  • RDBMS such as PostgreSQL, and Oracle
  • NoSQL databases like Apache HBase, Azure CosmosDB, and MongoDB

When to use virtualization

Deciding whether to virtualize or persist a data source depends on various factors to the need of your application, data requirements, and system architecture. Consider virtualizing data when:

  • The source system has a performant, stable data model that aligns with the needs of the C3 application.
  • The source system's data model is unlikely to change frequently.
  • Real-time or near-real-time data access is critical.
  • The data requires little or no transformation or preprocessing.
  • The data volume is low, or the query load on the source system is manageable.

Data virtualization is useful when the source system lacks a robust mechanism for tracking data updates, making incremental data ingestion into C3 challenging.

Consider persisting data in C3 when:

  • The uptime of the C3 application is critical, and your application must be self-reliant.
  • Your C3 application supports specific use cases or query patterns (for example, certain data science or UI requirements) that are optimized for performance.
  • Your C3 base application requires a specific data model for scalability, reliability, and performance.
  • The data volume is large, or the queries are complex and require significant processing power.

The following sections describe some more considerations of using virtualization on the platform.

No duplication of data

Virtualizing data means that you do not have to store multiple copies of the same data across different systems. This can significantly reduce storage costs, since you only reference the data instead of duplicating it.

Data is accessed directly from the source, so any updates made to the source data are immediately reflected in the application without the need for synchronization or replication processes.

By avoiding the complexities associated with data duplication and synchronization, managing data becomes simpler. Virtual Types abstract the intricacies of data storage and provide a straight-forward interface for data access.

Organizations can establish clear governance over data access without needing to manage multiple copies of the same dataset. This can simplify compliance with data protection regulations.

Real-time data access

External Types provide real-time or near-real-time access to data. This is crucial for applications that rely on up-to-date information, such as monitoring dashboards or dynamic reporting systems. Access to real-time data can lead to faster decision-making processes, enhancing operational efficiency and responsiveness to market changes.

Disadvantages of virtualization

There are some limitations to virtualizing your data on the platform. Consider the following sections when building your data pipelines.

No stored calculations

Since External Types often do not store data within the application, any calculations or aggregations must be computed on-the-fly during each query. This can lead to increased computational overhead and slower response times, particularly for complex queries or large datasets.

The absence of pre-computed values means that performance can suffer during data retrieval, especially when multiple calculations are required for the data being accessed. You may consider persisting data or adopting a hybrid approach when managing your data pipelines.

No support for hierarchies

Hierarchies are generally not supported in virtualization or External Types.

  • Hierarchies require a well-defined structure that outlines relationships between different data entities. Virtualized or External Types, which often reference data from separate systems or sources, may not maintain the necessary structure or constraints to support hierarchical relationships effectively.
  • Hierarchical structures often depend on the enforcement of referential integrity to ensure that parent-child relationships remain intact. Since External Types do not enforce these relationships, it becomes difficult to manage and validate hierarchies.
  • Data in external sources can change independently of the C3 AI application. This dynamic nature can lead to inconsistencies in hierarchical relationships, making it challenging to enforce and manage hierarchies reliably.

No support for timed values

Timed values are generally not supported in external types or virtualization within the platform. Virtualization on the platform is largely incompatible with requirements for managing temporal data effectively.

  • Timed values typically rely on a well-defined and structured data model, which includes aspects like versioning and history tracking. External Types and virtualization do not enforce a strict schema.

  • Virtualization often focuses on providing real-time access to data without storing historical context. Timed values inherently require historical data management to track changes over time, which virtualization does not provide.

  • Implementing timed values can introduce performance overhead in terms of storage and processing. Virtualized environments prioritize speed and efficiency in data access, which may conflict with the additional complexity of managing timed values.

  • Timed values often need to maintain referential integrity with other data points over time. Virtualized Types may not support this level of relational integrity.

    For example, TimedRelations and TimedIntervalRelations are not supported for External Types. This may limit the granularity of time-bound analyses for maintenance predictions.

No support for Parametric Types

Parametric Types allow the definition of a type that can take one or more parameters, enabling more flexible and reusable data models. Unfortunately, Parametric Types are generally not supported for External Types.

Worse metric performance

Accessing data from external sources can introduce latency due to network calls or other factors. Latency or performance issues with your external data source can lead to poor performance for processing data in C3.

Create connections to external data

Use the SqlSourceSystem Type, along with a JdbcCredentials Type stored in the application's JdbcStore, to connect to external database systems from your application.

Model an external database system

A SqlSourceSystem Type instance models the external database system in an application. For example, if your table MYTABLE lives in an external database system and you have defined an External Entity Type that correctly models the schema of the table, define a SqlSourceSystem with a .json file in the ./metadata/SqlSourceSystem/ folder of your package as follows:

JSON
{
    "name": "My External Database System"
}

Connect to an external database system

JdbcCredentials is used to authorize the connection to an external database system. The platform offers JDBC connectors to many external systems, such as Databricks, Snowflake, or MS SQL Server. For an application to use this credential at runtime, it must be added to the JdbcStore of the application.

For example, to connect to an external Snowflake database, run the following code snippet:

JavaScript
var credentials = JdbcCredentials.fromServerEndpoint("<my_account>.snowflakecomputing.com", -1, DatastoreType.SNOWFLAKE,
    "<table>", null, "<username>", "<password>");
JdbcStore.forName("My External Database System").setCredentials(credentials);
JdbcStore.forName("My External Database System").setExternal();

See the DatastoreType Type for a comprehensive list of supported external data stores.

Validate the external database connection

To validate that the credential has been set, run the following:

JavaScript
SqlSourceSystem.forName("My External Database System").ping()

The connection has been configured if the function returns { "reachable": true }.

Use external data in an application

Use the External and SqlSourceCollection Types to model external data in your application after connecting to an external database system:

Create an External Entity Type

Entity Types make up the operational data model of an application. When an Entity Type mixes the External Type, this indicates that the data of this type lives in an external database management system. These types still mix Persistable, so all the same APIs for data access and manipulation are still available. At runtime, the database engine of the C3 Agentic AI Platform generates SQL with a syntax that is understood by the source system, and pushes that query to the external system.

External Entity Types must declare a schema that matches the schema of the external database system using the schema name keywords. For example, imagine that you have an external SQL database table MYTABLE in a schema called MYSCHEMA with the following columns:

  • ID: varchar(55)
  • FIELD_1: datetime
  • FIELD_2: varchar(55)
  • FIELD_3: int

To model this external database table in an application, define an External Entity Type with a .c3typ file in the ./src/ directory of your package as follows:

Type
entity type MyExternalType mixes External, NoSystemCols schema name "[MYSCHEMA].[MYTABLE]" {
    id:     ~ schema name "ID"
    field1: datetime schema name "FIELD_1"
    field2: string schema name "FIELD_2"
    field3: int schema name "FIELD_3"
}

External Type data fields must be able to be mapped to the primitive data types of the C3 Agentic AI Platform as follows:

External Database Column TypesType System Primitive Types
TINYINT,SMALLINT,INTEGERInteger
BIGINTLongInt
FLOAT, REALFloat
DoubleDouble
NUMERICLongInt / Decimal
DECIMALDecimal
CHAR,VARCHAR, LONGVARCHAR, NCHAR, NVARCHAR, LONGNVARCHARString
DATE, TIMESTAMPDateTime
BINARY, VARBINARY, LONGVARBINARYBinary
BLOBBinary
CLOB, NCLOBString
BOOLEANBoolean
TIMESTAMP_WITH_TIMEZONEDateTime
BITBoolean

Other external database types, including the following, are not supported at this time:

  • TIME
  • NULL
  • JAVA_OBJECT
  • DISTINCT
  • STRUCT
  • ARRAY
  • REF
  • DATALINK
  • SQLXML
  • REF_CURSOR
  • TIME_WITH_TIMEZONE

A SqlSourceCollection Type instance links your External Type to the SqlSourceSystem containing your table.

You can define a SqlSourceCollection with a .json file in the ./metadata/SqlSourceCollection/ of your package:

JSON
{
    "name": "MyExternalType",
    "source": "MyExternalType",
    "sourceSystem": {"name": "My External Database System"}
} 

Infer source schemas

The SqlSourceCollection#inferSourceType method can be used to simplify creating External Entity Types. For instance, the following code snippet generates the Entity Type definition, which can be downloaded and saved as a .c3typ file in the ./src/ folder of your package, which is inferred from the external database table:

JavaScript
var name = "MyExternalType";
var typeMeta = SqlSourceCollection.forName(name).inferSourceType2();
c3DL(typeMeta.toString(), "plain/text", name + ".c3typ")

This downloads a MyExternalType.c3typ file from the browser, with the following definition:

Type
entity type MyExternalType mixes External, NoSystemCols schema name "[MYSCHEMA].[MYTABLE]" {
    id:     string schema name "ID"
    field1: datetime schema name "FIELD_1"
    field2: string schema name "FIELD_2"
    field3: int schema name "FIELD_3"
}

When defining External Types, the id field must be mapped to a column in the table, or define a composite key. A composite key is when two or more columns are used to uniquely identify each row a table.

Composite keys and external Types

A composite key is a type of key that consists of two or more attributes (or columns) used together to uniquely identify a record in a table. When none of the individual attributes is sufficient by itself to uniquely identify records, a combination of them can be used.

For example, consider an external data source that contains a table, student.records, which stores university student enrollment records:

student_idcourse_idtermgrade
0016.1903FallA
00118.05FallB
00218.05FallB
00318.06SpringB
0036.1020FallA

Neither the student_id nor the course_id alone would be sufficient to uniquely identify a record since a student can be enrolled in multiple courses and a course can have multiple students. However, the combination of student_id and course_id would be unique for every record, making them suitable for a composite key.

CompositeKey definition:

Type
type MyKey mixes CompositeKey {
    student_id : string
    course_id : string
}

Type definition (arbitrary example):

Type
entity type MyCompKey mixes MyKey, External, NoSystemCols schema name 'default.student.records' {
    student_id : !string
    course_id : !string
    term : string
    grade: string
}

Run the following command in the C3 AI Console to view the results:

JavaScript
c3Grid(MyCompKey.fetch());

Notice in the table below the id field is a concatenation of the student_id and course_id fields.

idstudent_idcourse_idtermgrade
0001#18.0500118.05FallB
1001#6.19030016.1903FallA
2002#18.0500218.05FallB
3003#18.0600318.06SpringB
4003#6.10200036.1020FallA

See also

Was this page helpful?