Testing Agents
Write test cases, configure evaluation criteria, and run automated evaluations with AgentEvaluator
Basic Workflow
The evaluation workflow has three steps: write test cases, set thresholds, and run.
import { AgentEvaluator, AgentBuilder } from "@iqai/adk";
// 1. Build your agent
const { agent } = await AgentBuilder.create("weather_agent")
.withModel("gemini-2.5-flash")
.withTools([weatherTool]) // Assume weatherTool is a pre-defined tool.
.withInstruction("You are a weather assistant.")
.build();
// 2. Run evaluation against a directory of test files
await AgentEvaluator.evaluate(agent, "./evaluation/tests");
// 3. If no error is thrown, all metrics passed their thresholds
console.log("All evaluations passed!");AgentEvaluator.evaluate() finds all *.test.json files recursively in the given path, loads the sibling test_config.json for thresholds, runs the agent against each test case, and throws an Error if any metric falls below its threshold.
Writing Test Cases
Test cases use the EvalSet JSON schema. Each file must end with .test.json.
Full Example
{
"evalSetId": "weather-agent-tests",
"name": "Weather Agent Evaluation",
"creationTimestamp": 0,
"evalCases": [
{
"evalId": "case-1",
"conversation": [
{
"creationTimestamp": 0,
"userContent": {
"role": "user",
"parts": [{ "text": "What is the weather in London?" }]
},
"finalResponse": {
"role": "model",
"parts": [{ "text": "The weather in London" }]
},
"intermediateData": {
"toolUses": [
{
"name": "get_weather",
"args": { "city": "London" }
}
],
"intermediateResponses": []
}
}
]
}
]
}Response-Only Test Case
When you only care about the final response (ROUGE scoring), omit intermediateData:
{
"evalId": "greeting-test",
"conversation": [
{
"creationTimestamp": 0,
"userContent": {
"role": "user",
"parts": [{ "text": "Hello!" }]
},
"finalResponse": {
"role": "model",
"parts": [{ "text": "Hello! How can I help you today?" }]
}
}
]
}Tool Trajectory Test Case
When you care about which tools are called with what arguments:
{
"evalId": "search-test",
"conversation": [
{
"creationTimestamp": 0,
"userContent": {
"role": "user",
"parts": [{ "text": "Search for TypeScript tutorials" }]
},
"intermediateData": {
"toolUses": [
{
"name": "web_search",
"args": { "query": "TypeScript tutorials" }
}
],
"intermediateResponses": []
}
}
]
}Multi-Turn Test Case
Test conversations with multiple turns:
{
"evalId": "multi-turn-test",
"conversation": [
{
"creationTimestamp": 0,
"userContent": {
"role": "user",
"parts": [{ "text": "What is the weather in London?" }]
},
"finalResponse": {
"role": "model",
"parts": [{ "text": "The weather in London is sunny, 22°C." }]
},
"intermediateData": {
"toolUses": [{ "name": "get_weather", "args": { "city": "London" } }],
"intermediateResponses": []
}
},
{
"creationTimestamp": 0,
"userContent": {
"role": "user",
"parts": [{ "text": "What about Tokyo?" }]
},
"finalResponse": {
"role": "model",
"parts": [{ "text": "The weather in Tokyo is cloudy, 18°C." }]
},
"intermediateData": {
"toolUses": [{ "name": "get_weather", "args": { "city": "Tokyo" } }],
"intermediateResponses": []
}
}
]
}Stateful Test Case
Test agents that depend on session state using sessionInput:
{
"evalId": "stateful-test",
"conversation": [
{
"creationTimestamp": 0,
"userContent": {
"role": "user",
"parts": [{ "text": "What is my account balance?" }]
},
"finalResponse": {
"role": "model",
"parts": [{ "text": "Your account balance is $1,250." }]
}
}
],
"sessionInput": {
"appName": "banking_app",
"userId": "user_123",
"state": {
"account_balance": 1250,
"account_type": "checking"
}
}
}Schema Reference
EvalSet (top level):
| Field | Type | Required | Description |
|---|---|---|---|
evalSetId | string | Yes | Unique identifier for this test set |
name | string | No | Human-readable name |
description | string | No | Description of what this set tests |
evalCases | EvalCase[] | Yes | Array of test cases |
creationTimestamp | number | Yes | Creation time (use 0 for static files) |
EvalCase (each test case):
| Field | Type | Required | Description |
|---|---|---|---|
evalId | string | Yes | Unique identifier for this case |
conversation | Invocation[] | Yes | One or more conversation turns |
sessionInput | SessionInput | No | Pre-populated session state |
Invocation (each conversation turn):
| Field | Type | Required | Description |
|---|---|---|---|
invocationId | string | No | Optional identifier for this turn |
userContent | Content | Yes | The user message ({ role: "user", parts: [{ text: "..." }] }) |
finalResponse | Content | No | Expected agent response (for response metrics) |
intermediateData | IntermediateData | No | Expected tool usage (for trajectory metrics) |
creationTimestamp | number | Yes | Timestamp (use 0 for static files) |
IntermediateData (expected tool usage):
| Field | Type | Required | Description |
|---|---|---|---|
toolUses | FunctionCall[] | Yes | Expected tool calls in order ({ name, args }) |
intermediateResponses | Array | Yes | Intermediate responses (use [] if none) |
SessionInput (pre-populated state):
| Field | Type | Required | Description |
|---|---|---|---|
appName | string | Yes | Application name |
userId | string | Yes | User identifier |
state | Record<string, any> | Yes | Key-value state to pre-populate |
Configuring Criteria
Create a test_config.json file in the same directory as your test files:
{
"criteria": {
"tool_trajectory_avg_score": 1.0,
"response_match_score": 0.8
}
}Each key must be a valid metric name. The value is the minimum score required to pass.
Allowed metric keys:
tool_trajectory_avg_score— tool call sequence matching (0-1)response_match_score— ROUGE-1 response similarity (0-1)response_evaluation_score— LLM-as-judge quality score (1-5)safety_v1— safety / harmlessness check (0-1)final_response_match_v2— LLM judge binary validity (0-1, experimental)
If no test_config.json is found, the defaults are:
{
"tool_trajectory_avg_score": 1.0,
"response_match_score": 0.8
}Match Criteria to Your Test Data
Only include criteria for data you've provided. If your test cases don't
include intermediateData, don't set tool_trajectory_avg_score. If they
don't include finalResponse, don't set response_match_score.
Directory Structure
A typical evaluation setup:
evaluation/
├── tests/
│ ├── basic.test.json # EvalSet test cases
│ ├── edge-cases.test.json # More test cases
│ └── test_config.json # Shared criteria for this directory
└── agent.ts # Agent definitionAgentEvaluator.evaluate() searches the directory recursively for *.test.json files. Each test file uses the test_config.json found in its own directory (or falls back to defaults).
Test Case Design Tips
- Be representative — cover scenarios your agent will encounter in production, not just the happy path. Include edge cases, ambiguous queries, and multi-turn conversations.
- Define clear expectations — for each test case, specify the expected response (for ROUGE scoring), the expected tool calls with arguments (for trajectory scoring), or both.
- Start small, expand iteratively — begin with 5-10 test cases covering core functionality. Add failure modes as you discover them in production.
- Version control your tests — test files are plain JSON. Commit them alongside your agent code.
Integration with Test Runners
Use evaluation inside Vitest, Jest, or any test runner:
import { describe, it } from "vitest";
import { AgentEvaluator } from "@iqai/adk";
import { getMyAgent } from "./agent";
describe("Agent Evaluation", () => {
it("should pass all evaluation criteria", async () => {
const { agent } = await getMyAgent();
// Throws on failure — Vitest treats the thrown error as a test failure
await AgentEvaluator.evaluate(agent, "./evaluation/tests", 1);
}, 60_000); // Allow generous timeout for LLM calls
});Timeouts
Agent evaluations make real LLM API calls, so they are slower than unit tests.
Set generous timeouts (30-120 seconds) depending on the number of test cases
and numRuns.
Legacy Test Format
Deprecated
The legacy array format is auto-migrated at runtime but emits a deprecation
warning. Use the EvalSet schema for new test files. Use
migrateEvalDataToNewSchema() to convert existing files.
The older format uses a flat array:
[
{
"query": "What is the weather in London?",
"reference": "The weather in London",
"expected_tool_use": [
{ "name": "get_weather", "args": { "city": "London" } }
]
}
]Related Topics
📊 Metrics & Scoring
Deep dive into evaluation metrics, custom scoring, and statistical analysis
📐 Evaluation Patterns
Advanced patterns for CI/CD integration, domain-specific evaluation, and production monitoring
💡 Evaluation Concepts
Understand why and how agent evaluation works
🔧 Tools
Understand tool integration — key for testing tool trajectories