Embedding Models for the Vector Store
In C3 Generative AI, documents are parsed, chunked, and embedded into a vector store for additional retrieval steps. For information on multimodal parsing and chunking, see Multimodal Parsing.
Depending on your model architecture, training data, intended use cases, and embedding performance and quality, you may want to change the embedder in your application. C3 Generative AI supports any embedder from Hugging Face including mixedbread-ai/mxbai-embed-large-v1 and any LLM-based embedder such as text-embedding-3-large from OpenAI. For the benefits and costs of each embedder, read through the respective documentation of each model's source.
How to change the default embedder
The embedder is implemented in the Genai.Embedder Type which has a method called getEmbedder. This method takes a specification in Genai.Embedder.Spec
You should use the fields in the Embedder.Spec Type to customize your embedder.
By default, the built-in embedder is the e5 transformer model (intfloat/multilingual-e5-large-instruct). You can change this with the following steps:
HuggingFace Embedders
You can use the direct model names from HuggingFace or the predefined model enums in the application.
To use direct model names, use the following code:
```py
huggingface_embedder_spec = c3.Genai.Embedder.Spec(
embedderModelName='mixedbread-ai/mxbai-embed-large-v1',
embedderType='GenaiCore.Embedder.Hf',
).withDefaults()
embedder = c3.Genai.Embedder.getEmbedder(huggingface_embedder_spec)
```To use predefined models, use the following code:
```py
# Available predefined models:
# c3.Genai.Retriever.Dense.Embedder.MXBAI → 'mixedbread-ai/mxbai-embed-large-v1'
# c3.Genai.Retriever.Dense.Embedder.E5 → 'intfloat/e5-large-v2'
# c3.Genai.Retriever.Dense.Embedder.TASB → 'sentence-transformers/msmarco-distilbert-base-tas-b'
embedder_spec = c3.Genai.Embedder.Spec(
embedderModelName=c3.Genai.Retriever.Dense.Embedder.MXBAI,
embedderType='GenaiCore.Embedder.Hf',
).withDefaults()
embedder = c3.Genai.Embedder.getEmbedder(embedder_spec)
```LLM-Based Embedders
To use an LLM-based embedder, you should change the embedder type and the provider type. In the following code, you specify the embedder as an Azure OpenAI embedder.
```py
openai_embedder_spec = c3.Genai.Embedder.Spec(
embedderModelName='text-embedding-3-large',
embedderType='GenaiCore.Embedder.Llm',
providerType='GenaiCore.Llm.AzureOpenAi'
).withDefaults()
openai_embedder = c3.Genai.Embedder.getEmbedder(openai_embedder_spec)
```Other Supported Providers
AWS Bedrock and Google Vertex AI embedders are also available. Use providerType='GenaiCore.Llm.Bedrock' or providerType='GenaiCore.Llm.VertexAi' respectively.