Agent Evaluation
Automated quality assurance for your agents — test behavior, score performance, and catch regressions before they reach production
Your agent works — it answers questions, calls tools, and handles conversations. But how do you know it keeps working after you change a prompt, swap a model, or add a new tool?
Agent evaluation is automated QA for your agent. You define test scenarios with expected outcomes, run your agent against them, and get scored results. If scores drop below your thresholds, the evaluation fails — giving you a safety net before changes reach production.
Status
Evaluation is available in @iqai/adk (public but still evolving). Surface
may receive additive changes; core evaluator signatures are stable.
Key Features
- AgentEvaluator — single entry point: point it at your agent and a test directory, and it handles the rest
- Multiple Metrics — ROUGE scoring, LLM-as-judge, tool trajectory analysis, safety evaluation
- Flexible Test Format — JSON-based test cases with support for multi-turn conversations
- Local & Cloud Evaluation — run evaluations locally or integrate with Vertex AI
- Session State Support — test stateful agents with conversation history
- Configurable Thresholds — set pass/fail criteria for different evaluation metrics
Explore
📋 Evaluation Concepts
Core principles and challenges in agent evaluation
🧪 Testing Agents
Manual and automated approaches to testing agent behavior
📊 Metrics and Scoring
Measurement approaches for trajectory and response quality
🎯 Evaluation Patterns
Domain-specific evaluation strategies and best practices
Core Components
The evaluation framework includes these key components:
- AgentEvaluator — main entry point for automated agent performance assessment
- TrajectoryEvaluator — analyzes tool usage patterns and decision paths
- ResponseEvaluator — ROUGE-1 scoring and LLM-based response quality assessment
- SafetyEvaluatorV1 — evaluates response harmlessness and safety
- EvalSet Management — organized test cases with metadata and version control
- MetricEvaluatorRegistry — extensible system for custom evaluation metrics
- LocalEvalService — complete local evaluation pipeline with parallel execution
Getting Started
Workflow:
- Author Dataset — provide either:
- Legacy array format:
[{ "query": "...", "reference": "..." }](auto-migrated at runtime, deprecated warning emitted), or - New EvalSet schema (preferred):
- Legacy array format:
{
"evalSetId": "calc-v1",
"name": "Simple arithmetic",
"creationTimestamp": 0,
"evalCases": [
{
"evalId": "case-1",
"conversation": [
{
"creationTimestamp": 0,
"userContent": {
"role": "user",
"parts": [{ "text": "What is 2 + 2?" }]
},
"finalResponse": {
"role": "model",
"parts": [{ "text": "4" }]
}
}
]
}
]
}- Configure Criteria in
test_config.json(sibling to each*.test.json):
{
"criteria": {
"response_match_score": 0.8
}
}- Run Evaluation — signature:
await AgentEvaluator.evaluate(agent, pathOrFile, numRuns?).- Throws an Error if any metric fails threshold (no return value).
- Iterate — tighten thresholds / expand cases as agent improves.
Minimal usage example — use .build() when you need the agent instance for evaluation or reuse:
import { AgentBuilder, AgentEvaluator } from "@iqai/adk";
const { agent } = await AgentBuilder.create("eval_agent")
.withModel("gemini-2.5-flash")
.withInstruction("Answer briefly and accurately.")
.build();
// Point at a directory — finds all *.test.json files recursively
await AgentEvaluator.evaluate(agent, "./evaluation/tests");
// Throws an Error if any metric fails its threshold`.build()` vs `.ask()`
Use .build() when you need an agent instance — for evaluation, multi-agent
composition, or reuse across multiple calls. Use .ask() for one-off queries
where you just need the response.
Programmatic usage with an already-loaded EvalSet:
import * as fs from "node:fs/promises";
import { AgentEvaluator } from "@iqai/adk";
import type { EvalSet } from "@iqai/adk";
import { agent } from "./my-agent"; // Your agent instance
const evalSet: EvalSet = JSON.parse(
await fs.readFile("./tests/basic.test.json", "utf-8"),
);
await AgentEvaluator.evaluateEvalSet(
agent,
evalSet,
{ response_match_score: 0.8 },
2, // numRuns
true, // printDetailedResults
);Default criteria if test_config.json absent:
{ "tool_trajectory_avg_score": 1.0, "response_match_score": 0.8 }Each metric key must be one of: tool_trajectory_avg_score, response_evaluation_score, response_match_score, safety_v1.