Agent Evaluation Overview
Agent Evaluation gives you a repeatable way to measure agent quality before changes reach production. It answers the question that matters most when tuning AI systems: did this change make things better or worse?
Agent Evaluation is designed for offline evaluation workflows. Use it for development, QA, and regression analysis. For live production monitoring, see Observability and Monitoring Overview.
The problem it solves
AI agents are sensitive to change. Modifying a prompt, swapping a large language model (LLM), adding a tool, or adjusting a retrieval strategy can improve some responses while silently degrading others. Without a structured evaluation process, these regressions go undetected until they surface in production.
Agent Evaluation addresses this by giving teams:
- Consistent benchmark: the same versioned dataset is used across every run, so comparisons are always apples-to-apples.
- Quantified deltas: numeric metric scores per test case and per run, with side-by-side comparison across two runs.
- Traceable debugging: every run can capture a full execution trace, so a failing test case leads directly to the span that caused the failure.
- Team-scale repeatability: datasets, experiments, metrics, and results are stored server-side and accessible to everyone on the team.
Without structured evaluation, teams rely on manual spot-checks and informal human judgment, which do not scale and cannot catch regressions systematically.
Agent Evaluation working
The framework is organized around four core entities that map to the natural evaluation lifecycle:
| Entity | What it represents | Why it matters |
|---|---|---|
| Dataset | A versioned collection of test cases — structured or unstructured. | Freeze a known-good benchmark so every comparison is reproducible. |
| Experiment | A named configuration that combines datasets, metrics, and a runner. | Group related evaluation runs under one coherent setup. |
| Run | A single execution of an experiment against an agent. | One run per dataset per execution — each run is a snapshot of agent quality at a point in time. |
| Metric | A scoring function applied to each test-case result. | Quantify quality in a way that can be sorted, filtered, and compared. |
Datasets are versioned when you upload or update them; experiments reference a specific dataset version so that every run uses the same benchmark and comparisons are reproducible.
Each run can also capture an associated execution trace (traceId), which enables span-level analysis for debugging and root-cause investigation.
Structured and unstructured evaluation
Agent Evaluation supports two test-case models:
- Structured test cases: input/expected-output style test cases (
input,expectedOutput, optionalcontext, optionalexpectedTools). Use these when your agent is answering questions or completing tasks where correct answers are known. - Unstructured test cases: flexible JSON payload test cases (
fields) plus optional metadata. Use these when your evaluation target is not a single input-to-output transformation, for example multi-turn scenarios, tool orchestration traces, or open-ended responses.
Metrics and scoring
Metrics are attached to experiments and applied to every test-case result in every run.
- Scores are stored per metric in each test-case result (
metricResults). - Run-level tables display metric scores as percentages, averaged across all test cases in the run.
- A metric pass is treated as a score of
1.0(100%). - Metric failures are highlighted in the Compare Runs view, ranked by score delta.
Supported metric patterns:
| Metric type | API type | Best for |
|---|---|---|
| DeepEval metrics | GenaiCore.Eval.Metric.DeepEval | Standard NLP/LLM quality metrics such as answer relevancy, faithfulness, and exact match. |
| Custom Python metrics | GenaiCore.Eval.Metric.NativePy | Domain-specific business rules and heuristic checks. |
| Rubric metrics | GenaiCore.Eval.Metric.Rubric | LLM-graded scoring for unstructured or open-ended test cases. |
The evaluation lifecycle
A typical evaluation cycle follows this sequence:
- Create or upload a dataset (CSV or JSON) with test cases and optional recommended metrics.
- Create an experiment that references one or more datasets plus the metric set to apply.
- Run the experiment against an agent or deployment.
- Review run outcomes at run level (aggregate scores), test-case level (individual rows), and metric level (per-metric reasons).
- Inspect execution traces for test cases that failed, using span-level breakdowns.
- Compare two runs side by side to quantify regressions and improvements, including per-metric score deltas, pass/fail changes, and which test cases regressed or improved.
To learn more about Agent Evaluations via a python notebook, see GenAI Platform Tutorials - Agent Evaluation.
To get started with Agent Evaluation, see Getting Started - Agent Evaluation.