Vector Store
The C3 Agentic AI Platform supports pgvector, an extension for PostgreSQL that enables efficient storage, retrieval, and manipulation of vector data. It provides vector data management and querying capabilities within your existing database environment.
Vectors play a critical role in many AI and machine learning applications, especially in tasks involving similarity searches, such as recommendation systems, image recognition. Vectors are also important in generative AI tasks, like retrieval-augmented generation, where vectors help in sourcing relevant text to inform content creation. Vector store integration on the C3 Agentic AI Platform is catered to developers who build Generative AI applications by facilitating efficient vector operations at scale.
The C3 Agentic AI Platform offers the following benefits:
Seamless Integration — Since PostgreSQL is part of the standard platform deployment, incorporating
pgvectorutilizes familiar infrastructure and requires minimal additional configuration.Scalability — The C3 Agentic AI Platform is optimized for handling large-scale vector data.
pgvectorsupports advanced indexing mechanisms like IVFFlat and HNSW, which is suitable for enterprise-level applications.Flexibility — The platform accommodates a range of vector operations, from precise k-nearest neighbor (KNN) searches to approximate nearest neighbor (ANN) searches using different indexing strategies.
Use the C3 Agentic AI Platform to access key features:
Vector storage — Manage vectors as primary entities within your database environment.
Advanced indexing — Enhance search speed and accuracy using sophisticated indexing techniques.
Comprehensive search capabilities — Run precise or approximate searches to find vectors most similar to your query vectors.
Configure an Entity Type to store vector embeddings
Vectors can be integrated into any PostgreSQL table by defining an EntityType Type that includes the following details:
Vector Field Annotation — Apply a
vectorannotation directly to the vector field in your entity definition to specify necessary attributes. This annotation should include:dimension— The dimensionality of the vector, indicating the number of elements in the vector. This parameter is crucial since it defines the size and structure of the vector data.
Index Configuration Annotation — Use a
dbannotation to specify the index settings for the vector field. This annotation allows for flexible index configuration based on the indexing method chosen.fields— An array that details each field to be indexed. Each entry in this array must include:name— The name of the vector field (for example,emb) for which the index is to be created.opClass— The distance metric used for indexing, such asL2for Euclidean distance.
Depending on the type of index used, additional parameters are required. For HNSW Indexes:
neighborsCount— Specifies the number of nearest neighbors to consider, impacting the search's performance and accuracy.efConstruction— Defines the size of the dynamic candidate list during the index construction, influencing the build speed and the quality of the index.
For IVFFlat Indexes:
clustersCount— Determines the number of clusters into which the vector space is divided, affecting the granularity and efficiency of the search.
If no specific index-related parameters are set within the annotation, the vector search will default to using a simple k-nearest neighbors (KNN) search.
Here is an example Type used to store vector embeddings:
/**
* This is an example Type.
*/
@db(index = [{fields:[{name:emb, opClass:L2}], neighborsCount:16, efConstruction: 40}])
entity type VectorTestType {
/**
* The ID of the doc.
*/
id: ~
/**
* The text of the doc.
*/
text: string
/**
* The vector embedding.
*/
@vector(dimension=768)
emb: string
}For more information on creating an entity Type, see Platform Types in the C3 AI Type System. For more information on C3 Agentic AI Platform Types, see Platform Types in the C3 AI Type System.
Insert and retrieve vector data
Since the data is modeled as an EntityType, it can be loaded using standard data integration processes.
While directly loading vector data is supported, it is often not common practice. Typically, vector data is generated as an output from complex data processing pipelines. These pipelines involve various preprocessing steps, such as normalization, tokenization, or feature extraction, followed by converting raw data into vector form using specific embedding models before storage.
Although the C3 Agentic AI Platform does not natively support all the preprocessing and transformation steps required for converting text data into vectors as a single automated process, it does facilitate the hosting of embedding models using the C3 AI Model Inference Service (MIS). Additionally, developers have the flexibility to integrate any necessary libraries to handle other preprocessing tasks within the platform's development environment. This setup allows for you to create a comprehensive pipeline that includes both custom preprocessing logic and embedding, utilizing C3 AI's platform capabilities. The resulting vectors can then be inserted into the pgvector table, making full use of the platform's environment to manage the entire data preparation and embedding process. For more information on MIS, see the C3 AI Model Inference Service Administration topic.
After data is loaded, it can be queried using existing C3 AI APIs. For more information, see the Fetch and Filter Data From the Database notebook.
Build a vector index
To optimize the retrieval of vector data, you can build a vector index on your dataset. Indexing is crucial for improving search performance and efficiency, especially with large datasets. To trigger the indexing process on your vector data, use the following API commands in the static console of your application:
// Retrieve the root table name for the vector data
var tableName = VectorTestType.meta().rootC3TableName();
// Command to rebuild the index on the specified table
DbAdmin.rebuildIndex(tableName);Building a vector index can be a time-consuming process, particularly for large datasets. The time required depends on several factors.
The choice between IVFFlat and HNSW indexing strategies impacts both the build time and the search performance.
IVFFlat is known for faster index build times and lower memory usage, making it an advantageous approach for databases with up to a million entries. It works by dividing vectors into clusters, enabling more targeted and resource-efficient searches.
HNSW uses a multi-layered graph structure which requires more time and memory to build, but offers superior speed-recall tradeoffs. It is particularly effective in scenarios with sparse initial data or when data is gradually accumulated, and it is less sensitive to dataset size changes, which reduces the need for frequent re-indexing.
Dataset Size — The volume and dimensionality of vectors significantly impact the time required to build the index.
System Resources — The available compute power and memory can drastically influence the speed of indexing.
Perform similarity search
To perform a similarity search, use the following API commands in the static console of your application:
var queryVector = '[1, 2, 3, 4, 5]'; // Define the query vector
var distanceMetric = "L2"; // Specify the distance metric to use
// Build the evaluation specification
var spec = EvalSpec.builder()
.projection("docid, emb") // Specify fields to retrieve
.order("vectorDistance('emb', " + queryVector + ", {'metric': '" + distanceMetric + "'})") // Order results by vector distance
.limit(3) // Limit the number of results returned
.build();
// Execute the evaluation with the specified spec
VectorTestType.eval(spec);queryVectordefines the vector against which other vectors are compared.distanceMetricspecifies the method used to compute distances. Available options are:"L2"for Euclidean distance"INNER_PRODUCT"for dot product"COSINE"for cosine similarity
The
orderparameter in theEvalSpecdirects the function to compute the distance between each vector in the 'emb' column and thequeryVectorusing the specified metric. The results are then ordered based on these distances, from the most relevant match to the least relevant.The
limitfield restricts the number of results returned, effectively controlling the top 'k' results in the similarity search.
The nprobes metric for the IVFFlat index and the ef_search metric for the HNSW index are currently configured at the PostgreSQL instance level, with default values of 1 and 40 respectively.
Optimize pgvector usage
There are several advanced strategies aimed at enhancing vector similarity searches. The following sections describe some key practices to consider for optimizing pgvector's performance across any deployment environment:
Use pg_prewarm
Use the DbAdmin.pgPrewarm(tableName) API to prewarm tables that are frequently accessed. This is particularly useful for reducing I/O latency at the start of heavy operations.
Schedule prewarming during periods of low demand to ensure that the data is ready in the cache when needed without impacting peak time performance.
Indexing Strategies
Refer to the table below for an overview of the key indexing strategies, their respective characteristics, and ideal use cases.
| Index Type | IVFFlat | HNSW |
|---|---|---|
| Description | IVFFlat uses a k-means clustering algorithm to create a specified number of clusters or "lists." Each vector in the database is assigned to the nearest cluster based on distance metrics like L2 or cosine similarity. | Hierarchical Navigable Small Worlds (HNSW) uses a multi-layered graph structure to optimize both the speed and accuracy of search queries. It supports dynamic data updates without the need to rebuild the entire index. |
| Setting Parameters | Lists (Centers): The number of clusters or lists defines how many centers your data is divided into. A larger number of lists increases the granularity of indexing but may lead to longer index build times and larger index sizes. Probes: This parameter defines how many lists are searched during a query. Increasing the number of probes can improve recall but may reduce query performance due to the additional computation required. | M (Max Links per Node): This parameter affects the number of bidirectional links each element has. Higher values can improve recall but increase index size and build time. ef_construction: Controls the size of the dynamic candidate list used during the index construction, influencing how thorough the construction process is. ef_search: Used during search queries to define the size of the dynamic candidate list, impacting the depth of the search and consequently recall and performance. |
| Use Cases | High Recall Requirements: Increase the number of probes to ensure more lists are checked, improving the chances of finding relevant results. Rapid Query Performance: Reduce the number of lists and probes for faster query times, suitable for environments where response time is critical. | Large, Static Datasets: HNSW is highly efficient for large datasets where the index does not need frequent updates, thanks to its deep linking and fast search capabilities. High Dimensional Data: Effective for high-dimensional vector spaces due to its hierarchical structure, which efficiently narrows down the search space. |
| Performance Considerations | Generally faster to build and less complex in structure, making it suitable for dynamic environments where indexes must be rebuilt or updated frequently. | Typically more time-consuming and resource-intensive to build, but offers superior query performance, especially in high-dimensional spaces. |
Distance metrics
Refer to the table below for a comparison of different metrics and their applications.
| Distance metric | Description | Characteristics | When to use |
|---|---|---|---|
| Euclidean (L2) | Euclidean distance measures the straight-line distance between two vectors in a multidimensional space. It is calculated by taking the square root of the sum of the squared differences between corresponding components of the two vectors. | Sensitive to the magnitudes and direction of the vectors. Affected by the scale of vector components. | Suitable for systems where the magnitude of vector components (for example, counts or measures) is significant and could influence the similarity measurement. Often used with models that employ basic vector encoding methods like Locality Sensitive Hashing (LSH) or those not trained with a specific loss function related to vector orientation. |
| Cosine Similarity | Measures the cosine of the angle between two vectors, focusing purely on direction rather than magnitude. The result ranges from -1 (exactly opposite), 0 (orthogonal), to 1 (exactly the same direction). | Independent of vector magnitude; sensitive only to vector direction. Useful in comparing the orientation or "angle" between vectors. | Ideal for text-related applications such as semantic search and document classification where the direction of the vectors (representing word or document embeddings) is more relevant than their magnitude. Also Appropriate for recommendation systems that suggest items based on similarity in user behavior or item characteristics, particularly when the magnitude (for example, frequency of interaction) is not related to vector orientation. |
| Dot Product Similarity | Involves multiplying corresponding components of two vectors and summing the results. It can also be viewed as the product of the vectors' magnitudes and the cosine of the angle between them, providing a measure that incorporates both direction and magnitude. | Sensitive to both the magnitudes and the directions of the vectors. The sign and magnitude of the dot product convey the angle and the proportionality of vector lengths, respectively. | Frequently used in systems trained with algorithms that optimize based on dot products, such as certain types of neural networks and matrix factorization techniques in recommendation systems. Suitable for applications where both the direction and magnitude of vectors are crucial, such as collaborative filtering, where the dot product of user and item vectors can predict user preferences based on both interest (direction) and intensity of preference (magnitude). |