Retrieval Evaluation
Overview
Retrieval evaluation measures how well your C3 Generative AI configuration answers questions against a known set of expected answers and sources. You can assess the retrieval pipeline, compare configurations, and catch regressions before they reach production.
The C3 Generative AI Platform provides the Agent Evaluation framework to run these evaluations end to end from Studio. Agent Evaluation supports versioned datasets, reusable experiments, per-test-case metric scores, execution traces, and side-by-side run comparison. Use it to evaluate the retrieval pipeline (for example, a RAG tool or a retrieval-backed agent) the same way you evaluate any other agent.
What Agent Evaluation provides
- Datasets: Versioned collections of test cases, each with an input question and optional expected output, context, and expected tools. Freeze a benchmark once and reuse it across every run.
- Experiments: Named configurations that combine one or more datasets with a set of metrics. Group related runs under one experiment so comparisons stay consistent.
- Runs: Individual executions of an experiment against an agent. Each run produces metric scores per test case and aggregate scores at the run level.
- Metrics: Scoring functions applied to every test-case result. Choose from DeepEval metrics, rubric metrics for open-ended outputs, or custom Python metrics for domain-specific rules.
- Trace integration: Each run captures an execution trace, so you can drill from a failing test case to the exact span that caused the failure.
- Compare Runs: Quantify regressions and improvements with per-metric score deltas between two runs.
How to run a retrieval evaluation
To evaluate retrieval for your application, follow the standard Agent Evaluation workflow:
- Create a dataset of retrieval questions with their expected answers and sources. You can create datasets from the UI or upload a CSV or JSON file from Python.
- Define the metrics that measure retrieval quality for your use case.
- Create an experiment that pairs the dataset with the metrics.
- Run the experiment against the agent or deployment that serves retrieval queries.
- Review results at the run, test case, and metric level. Inspect traces for failing test cases.
- Compare two runs side by side to quantify the impact of configuration changes.
For a step-by-step walkthrough, see the getting-started tutorial linked in See also.