Write test cases, configure evaluation criteria, and run automated evaluations with AgentEvaluator

Basic Workflow

The evaluation workflow has three steps: write test cases, set thresholds, and run.

import { AgentEvaluator, AgentBuilder } from "@iqai/adk";

// 1. Build your agent
const { agent } = await AgentBuilder.create("weather_agent")
  .withModel("gemini-2.5-flash")
  .withTools([weatherTool]) // Assume weatherTool is a pre-defined tool.
  .withInstruction("You are a weather assistant.")
  .build();

// 2. Run evaluation against a directory of test files
await AgentEvaluator.evaluate(agent, "./evaluation/tests");

// 3. If no error is thrown, all metrics passed their thresholds
console.log("All evaluations passed!");

AgentEvaluator.evaluate() finds all *.test.json files recursively in the given path, loads the sibling test_config.json for thresholds, runs the agent against each test case, and throws an Error if any metric falls below its threshold.

Writing Test Cases

Test cases use the EvalSet JSON schema. Each file must end with .test.json.

Full Example

{
  "evalSetId": "weather-agent-tests",
  "name": "Weather Agent Evaluation",
  "creationTimestamp": 0,
  "evalCases": [
    {
      "evalId": "case-1",
      "conversation": [
        {
          "creationTimestamp": 0,
          "userContent": {
            "role": "user",
            "parts": [{ "text": "What is the weather in London?" }]
          },
          "finalResponse": {
            "role": "model",
            "parts": [{ "text": "The weather in London" }]
          },
          "intermediateData": {
            "toolUses": [
              {
                "name": "get_weather",
                "args": { "city": "London" }
              }
            ],
            "intermediateResponses": []
          }
        }
      ]
    }
  ]
}

Response-Only Test Case

When you only care about the final response (ROUGE scoring), omit intermediateData:

{
  "evalId": "greeting-test",
  "conversation": [
    {
      "creationTimestamp": 0,
      "userContent": {
        "role": "user",
        "parts": [{ "text": "Hello!" }]
      },
      "finalResponse": {
        "role": "model",
        "parts": [{ "text": "Hello! How can I help you today?" }]
      }
    }
  ]
}

Tool Trajectory Test Case

When you care about which tools are called with what arguments:

{
  "evalId": "search-test",
  "conversation": [
    {
      "creationTimestamp": 0,
      "userContent": {
        "role": "user",
        "parts": [{ "text": "Search for TypeScript tutorials" }]
      },
      "intermediateData": {
        "toolUses": [
          {
            "name": "web_search",
            "args": { "query": "TypeScript tutorials" }
          }
        ],
        "intermediateResponses": []
      }
    }
  ]
}

Multi-Turn Test Case

Test conversations with multiple turns:

{
  "evalId": "multi-turn-test",
  "conversation": [
    {
      "creationTimestamp": 0,
      "userContent": {
        "role": "user",
        "parts": [{ "text": "What is the weather in London?" }]
      },
      "finalResponse": {
        "role": "model",
        "parts": [{ "text": "The weather in London is sunny, 22°C." }]
      },
      "intermediateData": {
        "toolUses": [{ "name": "get_weather", "args": { "city": "London" } }],
        "intermediateResponses": []
      }
    },
    {
      "creationTimestamp": 0,
      "userContent": {
        "role": "user",
        "parts": [{ "text": "What about Tokyo?" }]
      },
      "finalResponse": {
        "role": "model",
        "parts": [{ "text": "The weather in Tokyo is cloudy, 18°C." }]
      },
      "intermediateData": {
        "toolUses": [{ "name": "get_weather", "args": { "city": "Tokyo" } }],
        "intermediateResponses": []
      }
    }
  ]
}

Stateful Test Case

Test agents that depend on session state using sessionInput:

{
  "evalId": "stateful-test",
  "conversation": [
    {
      "creationTimestamp": 0,
      "userContent": {
        "role": "user",
        "parts": [{ "text": "What is my account balance?" }]
      },
      "finalResponse": {
        "role": "model",
        "parts": [{ "text": "Your account balance is $1,250." }]
      }
    }
  ],
  "sessionInput": {
    "appName": "banking_app",
    "userId": "user_123",
    "state": {
      "account_balance": 1250,
      "account_type": "checking"
    }
  }
}

Schema Reference

EvalSet (top level):

Field	Type	Required	Description
`evalSetId`	`string`	Yes	Unique identifier for this test set
`name`	`string`	No	Human-readable name
`description`	`string`	No	Description of what this set tests
`evalCases`	`EvalCase[]`	Yes	Array of test cases
`creationTimestamp`	`number`	Yes	Creation time (use `0` for static files)

EvalCase (each test case):

Field	Type	Required	Description
`evalId`	`string`	Yes	Unique identifier for this case
`conversation`	`Invocation[]`	Yes	One or more conversation turns
`sessionInput`	`SessionInput`	No	Pre-populated session state

Invocation (each conversation turn):

Field	Type	Required	Description
`invocationId`	`string`	No	Optional identifier for this turn
`userContent`	`Content`	Yes	The user message (`{ role: "user", parts: [{ text: "..." }] }`)
`finalResponse`	`Content`	No	Expected agent response (for response metrics)
`intermediateData`	`IntermediateData`	No	Expected tool usage (for trajectory metrics)
`creationTimestamp`	`number`	Yes	Timestamp (use `0` for static files)

IntermediateData (expected tool usage):

Field	Type	Required	Description
`toolUses`	`FunctionCall[]`	Yes	Expected tool calls in order (`{ name, args }`)
`intermediateResponses`	`Array`	Yes	Intermediate responses (use `[]` if none)

SessionInput (pre-populated state):

Field	Type	Required	Description
`appName`	`string`	Yes	Application name
`userId`	`string`	Yes	User identifier
`state`	`Record<string, any>`	Yes	Key-value state to pre-populate

Configuring Criteria

Create a test_config.json file in the same directory as your test files:

{
  "criteria": {
    "tool_trajectory_avg_score": 1.0,
    "response_match_score": 0.8
  }
}

Each key must be a valid metric name. The value is the minimum score required to pass.

Allowed metric keys:

tool_trajectory_avg_score — tool call sequence matching (0-1)
response_match_score — ROUGE-1 response similarity (0-1)
response_evaluation_score — LLM-as-judge quality score (1-5)
safety_v1 — safety / harmlessness check (0-1)
final_response_match_v2 — LLM judge binary validity (0-1, experimental)

If no test_config.json is found, the defaults are:

{
  "tool_trajectory_avg_score": 1.0,
  "response_match_score": 0.8
}

Match Criteria to Your Test Data

Only include criteria for data you've provided. If your test cases don't include intermediateData, don't set tool_trajectory_avg_score. If they don't include finalResponse, don't set response_match_score.

Directory Structure

A typical evaluation setup:

evaluation/
├── tests/
│   ├── basic.test.json        # EvalSet test cases
│   ├── edge-cases.test.json   # More test cases
│   └── test_config.json       # Shared criteria for this directory
└── agent.ts                   # Agent definition

AgentEvaluator.evaluate() searches the directory recursively for *.test.json files. Each test file uses the test_config.json found in its own directory (or falls back to defaults).

Test Case Design Tips

Be representative — cover scenarios your agent will encounter in production, not just the happy path. Include edge cases, ambiguous queries, and multi-turn conversations.
Define clear expectations — for each test case, specify the expected response (for ROUGE scoring), the expected tool calls with arguments (for trajectory scoring), or both.
Start small, expand iteratively — begin with 5-10 test cases covering core functionality. Add failure modes as you discover them in production.
Version control your tests — test files are plain JSON. Commit them alongside your agent code.

Integration with Test Runners

Use evaluation inside Vitest, Jest, or any test runner:

import { describe, it } from "vitest";
import { AgentEvaluator } from "@iqai/adk";
import { getMyAgent } from "./agent";

describe("Agent Evaluation", () => {
  it("should pass all evaluation criteria", async () => {
    const { agent } = await getMyAgent();

    // Throws on failure — Vitest treats the thrown error as a test failure
    await AgentEvaluator.evaluate(agent, "./evaluation/tests", 1);
  }, 60_000); // Allow generous timeout for LLM calls
});

Timeouts

Agent evaluations make real LLM API calls, so they are slower than unit tests. Set generous timeouts (30-120 seconds) depending on the number of test cases and numRuns.

Legacy Test Format

Deprecated

The legacy array format is auto-migrated at runtime but emits a deprecation warning. Use the EvalSet schema for new test files. Use migrateEvalDataToNewSchema() to convert existing files.

The older format uses a flat array:

[
  {
    "query": "What is the weather in London?",
    "reference": "The weather in London",
    "expected_tool_use": [
      { "name": "get_weather", "args": { "city": "London" } }
    ]
  }
]

Testing Agents