Other Capabilities

Access control lists

Access control lists, or ACLs, may be enabled on SourceFiles and SourceCollections. For more information, see the access control topic.

To enable ACLs on SourceFiles, run the following:

JavaScript

Js.exec(
  "EnableAclPrivilege.forId('Genai.SourceFile').withField('enabled', true).merge({ mergeInclude: 'enabled' }); Genai.SourceFile.refreshAcls();"
);

To enable ACLs on SourceCollections, run the following:

JavaScript

Js.exec(
  "EnableAclPrivilege.forId('Genai.SourceCollection').withField('enabled', true).merge({ mergeInclude: 'enabled' }); Genai.SourceCollection.refreshAcls();"
);

Once ACLs are enabled, ensure the acl field is populated for the respective entity Type.

For SourceFiles

Retrievers

Genai.Retriever is the parent entity for all retriever types. Genai.Retriever.py has all common logic shared among retrievers.
Indexing happens asynchronously through the AsyncQueue.
Each new indexing action creates a Genai.Retriever.IndexAction which is started in order by Genai.Retriever.IndexAction.processQueues.
- This structure ensures that indexing/unindexing action A fully completes before B starts, to prevent conflicting updates to the index.

Genai.Retriever.PgVector is the recommended retriever for all production deployments. All source passages are stored along with their embeddings in Genai.Vector.SourcePassage. Since Genai.Vector.SourcePassage is an entity type, Genai.Retriever.PgVector fully supports efficient filtering on similarity searches, and also supports incremental updates to the set of indexed passages.

Genai.Retriever.Dense

Genai.Retriever.Dense is only recommended for experimentation with different in-memory vector stores. For production applications, use Genai.Retriever.PgVector

All of the logic for it was authored by C3 AI, and thus it can be extended, changed, and corrected.
Genai.Retriever.Dense allows permutations of dense Embedder implementations and various vector stores for retrieval.
See Genai.Retriever.Dense.Embedder for currently supported Embedders.
- MXBAI. The default Embedder used with better average score for the tasks Retrieval English leaderboard at MTEB leaderboard.
- E5 is the default. It takes the longest to calculate the embeddings, but it provides the best search results.
- DPR is faster to calculate than E5, slower than TAS-B, and provides results between the other two.
- TAS-B is the fastest, but it provides the poorest results.
See Genai.Retriever.Dense.RetrieverType for currently supported retrievers

How to configure an application to be run in an air-gapped environment

To configure an application to run in an air-gapped environment, we can set the following configuration values pointing to the downloaded zips of various tokenizers and models.

JavaScript

Genai.App.AirGapConfig.inst().setConfigValues({
  attributorTokenizerPath: '<filepath>',
  nltkSentenceTokenizerModelFilePath: '<filepath>',
  msmarcoDistilbertBaseTasBFilePath: '<filepath>',
  sourceFileChunkerTokenizerPath: '<filepath>',
  tableTextSplitterEncoderPath: '<filepath>',
  tatrZipPath: '<filepath>',
  spacyModelPath: '<filepath>',
  tokenTextSplitterEncoderPath: '<filepath>',
  nougatZipPath: '<filepath>',
  dprQuestionEncoderPath: '<filepath>',
  dprContextEncoderPath: '<filepath>',
});

After setting the configurations, you can confirm if all the files are accessible using Genai.App.AirGapConfig.validateConfigs() See Genai.App.AirGapConfig for more details.

Engine configurations

Genai.UnstructuredQuery.Engine and Genai.Retriever.Dense both run as an Engine and may be given an Engine.DeploySpec#threadPool when starting. Any values specified in the Engine.DeploySpec#threadPool will override the defaults.

Genai.UnstructuredQuery.Engine

Genai.UnstructuredQuery.Engine may be initialized with a specific named Genai.UnstructuredQuery.Engine.Config and the Engine.DeploySpec#threadPool may be set through Genai.UnstructuredQuery.Engine.Config#deploySpec. The threads available will autoscale with demand (actions sent to the Engine).

Each new thread will require additional memory for the all of the in-memory components of an Genai.UnstructuredQuery.Engine, so memory available for Python processes is usually the limiting factor. On a leader node with 20 GB of memory available to Python processes, 25 threads (the default maxThreads) performs well. Beyond 25 threads, more memory will likely be required.

Each leader or task node will start its own Genai.UnstructuredQuery.Engine for each named Genai.UnstructuredQuery.Engine.Config that's used. By default, all logic will use the Genai.UnstructuredQuery.Engine.Config with name default.

Genai.Retriever.Dense

Genai.Retriever.Dense starts a separate Genai.Retriever.Dense.Engine, since any Type that mixes Engine can't also mix Persistable due to a conflict in the update actions.

The default configuration for the Engine.DeploySpec may be overridden through Genai.Retriever.Dense#engineThreadPoolSpec.

Copy link to this sectionAccess control lists

Copy link to this sectionRetrievers

Copy link to this sectionGenai.Retriever.PgVector

Copy link to this sectionGenai.Retriever.Dense

Copy link to this sectionHow to configure an application to be run in an air-gapped environment

Copy link to this sectionEngine configurations

Copy link to this sectionGenai.UnstructuredQuery.Engine

Copy link to this sectionGenai.Retriever.Dense

Access control lists

Retrievers

Genai.Retriever.PgVector

Genai.Retriever.Dense

How to configure an application to be run in an air-gapped environment

Engine configurations

Genai.UnstructuredQuery.Engine

Genai.Retriever.Dense