Agent Evaluation
Comprehensive framework for testing and validating agent performance across scenarios
Agent evaluation provides systematic approaches to testing and validating agent performance, moving beyond prototype to production-ready AI systems.
Status
Evaluation is available in @iqai/adk
(public but still evolving). Surface may receive additive changes; core evaluator signatures are stable.
Overview
Unlike traditional software testing, agent evaluation must account for the probabilistic nature of LLM responses and the complexity of multi-step reasoning processes. Effective evaluation encompasses multiple dimensions from tool usage patterns to response quality.
Key Features
The ADK evaluation framework provides comprehensive agent testing capabilities:
- Agent Evaluator: Automated evaluation of agent performance across test cases
- Multiple Metrics: ROUGE scoring, LLM-as-judge, tool trajectory analysis, safety evaluation
- Flexible Test Format: JSON-based test cases with support for multi-turn conversations
- Local & Cloud Evaluation: Run evaluations locally or integrate with Vertex AI
- Session State Support: Test stateful agents with conversation history
- Configurable Thresholds: Set pass/fail criteria for different evaluation metrics
Documentation Structure
📋 Evaluation Concepts
Core principles and challenges in agent evaluation
🧪 Testing Agents
Current approaches and future automated testing methods
📊 Metrics and Scoring
Measurement approaches for trajectory and response quality
🎯 Evaluation Patterns
Domain-specific evaluation strategies and best practices
Core Components
The evaluation framework includes these key components:
- AgentEvaluator: Main entry point for agent performance assessment
- TrajectoryEvaluator: Analyzes tool usage patterns and decision paths
- ResponseEvaluator: ROUGE-1 scoring and LLM-based response quality assessment
- SafetyEvaluator: Evaluates response harmlessness and safety
- EvalSet Management: Organized test cases with metadata and version control
- MetricEvaluatorRegistry: Extensible system for custom evaluation metrics
- LocalEvalService: Complete local evaluation pipeline with parallel execution
Getting Started
Workflow:
- Author Dataset: Provide either:
- Legacy array format:
[{ "query": "...", "reference": "..." }]
(auto-migrated at runtime – deprecated warning emitted), or - New EvalSet schema (preferred):
- Legacy array format:
{
"evalSetId": "calc-v1",
"name": "Simple arithmetic",
"creationTimestamp": 0,
"evalCases": [
{
"evalId": "case-1",
"conversation": [
{
"creationTimestamp": 0,
"userContent": { "role": "user", "parts": [{ "text": "What is 2 + 2?" }] },
"finalResponse": { "role": "model", "parts": [{ "text": "4" }] }
}
]
}
]
}
- Configure Criteria in
test_config.json
(sibling to each*.test.json
):
{ "criteria": { "response_match_score": 0.8 } }
- Run Evaluation – signature:
await AgentEvaluator.evaluate(agent, pathOrFile, numRuns?)
.- Throws an Error if any metric fails threshold (no return value).
- Iterate – tighten thresholds / expand cases as agent improves.
Minimal usage example:
import { AgentBuilder, AgentEvaluator } from '@iqai/adk';
const { agent } = await AgentBuilder.create('eval_agent')
.withModel('gemini-2.5-flash')
.withInstruction('Answer briefly and accurately.')
.build();
await AgentEvaluator.evaluate(agent, './evaluation/tests');
// If it does not throw, all criteria passed.
Programmatic (advanced) usage using an already loaded EvalSet:
// If you parsed an EvalSet JSON yourself
// import type { Evaluation } from '@iqai/adk';
// await AgentEvaluator.evaluateEvalSet(agent, evalSet, { response_match_score: 0.8 });
Default criteria if test_config.json
absent:
{ "tool_trajectory_avg_score": 1.0, "response_match_score": 0.8 }
Each metric key must be one of: tool_trajectory_avg_score
, response_evaluation_score
, response_match_score
, safety_v1
.
Related Topics
How is this guide?