TypeScriptADK-TS

Agent Evaluation

Automated quality assurance for your agents — test behavior, score performance, and catch regressions before they reach production

Your agent works — it answers questions, calls tools, and handles conversations. But how do you know it keeps working after you change a prompt, swap a model, or add a new tool?

Agent evaluation is automated QA for your agent. You define test scenarios with expected outcomes, run your agent against them, and get scored results. If scores drop below your thresholds, the evaluation fails — giving you a safety net before changes reach production.

Status

Evaluation is available in @iqai/adk (public but still evolving). Surface may receive additive changes; core evaluator signatures are stable.

Key Features

  • AgentEvaluator — single entry point: point it at your agent and a test directory, and it handles the rest
  • Multiple Metrics — ROUGE scoring, LLM-as-judge, tool trajectory analysis, safety evaluation
  • Flexible Test Format — JSON-based test cases with support for multi-turn conversations
  • Local & Cloud Evaluation — run evaluations locally or integrate with Vertex AI
  • Session State Support — test stateful agents with conversation history
  • Configurable Thresholds — set pass/fail criteria for different evaluation metrics

Explore

Core Components

The evaluation framework includes these key components:

  • AgentEvaluator — main entry point for automated agent performance assessment
  • TrajectoryEvaluator — analyzes tool usage patterns and decision paths
  • ResponseEvaluator — ROUGE-1 scoring and LLM-based response quality assessment
  • SafetyEvaluatorV1 — evaluates response harmlessness and safety
  • EvalSet Management — organized test cases with metadata and version control
  • MetricEvaluatorRegistry — extensible system for custom evaluation metrics
  • LocalEvalService — complete local evaluation pipeline with parallel execution

Getting Started

Workflow:

  1. Author Dataset — provide either:
    • Legacy array format: [{ "query": "...", "reference": "..." }] (auto-migrated at runtime, deprecated warning emitted), or
    • New EvalSet schema (preferred):
{
  "evalSetId": "calc-v1",
  "name": "Simple arithmetic",
  "creationTimestamp": 0,
  "evalCases": [
    {
      "evalId": "case-1",
      "conversation": [
        {
          "creationTimestamp": 0,
          "userContent": {
            "role": "user",
            "parts": [{ "text": "What is 2 + 2?" }]
          },
          "finalResponse": {
            "role": "model",
            "parts": [{ "text": "4" }]
          }
        }
      ]
    }
  ]
}
  1. Configure Criteria in test_config.json (sibling to each *.test.json):
{
  "criteria": {
    "response_match_score": 0.8
  }
}
  1. Run Evaluation — signature: await AgentEvaluator.evaluate(agent, pathOrFile, numRuns?).
    • Throws an Error if any metric fails threshold (no return value).
  2. Iterate — tighten thresholds / expand cases as agent improves.

Minimal usage example — use .build() when you need the agent instance for evaluation or reuse:

import { AgentBuilder, AgentEvaluator } from "@iqai/adk";

const { agent } = await AgentBuilder.create("eval_agent")
  .withModel("gemini-2.5-flash")
  .withInstruction("Answer briefly and accurately.")
  .build();

// Point at a directory — finds all *.test.json files recursively
await AgentEvaluator.evaluate(agent, "./evaluation/tests");
// Throws an Error if any metric fails its threshold

`.build()` vs `.ask()`

Use .build() when you need an agent instance — for evaluation, multi-agent composition, or reuse across multiple calls. Use .ask() for one-off queries where you just need the response.

Programmatic usage with an already-loaded EvalSet:

import * as fs from "node:fs/promises";
import { AgentEvaluator } from "@iqai/adk";
import type { EvalSet } from "@iqai/adk";
import { agent } from "./my-agent"; // Your agent instance

const evalSet: EvalSet = JSON.parse(
  await fs.readFile("./tests/basic.test.json", "utf-8"),
);

await AgentEvaluator.evaluateEvalSet(
  agent,
  evalSet,
  { response_match_score: 0.8 },
  2, // numRuns
  true, // printDetailedResults
);

Default criteria if test_config.json absent:

{ "tool_trajectory_avg_score": 1.0, "response_match_score": 0.8 }

Each metric key must be one of: tool_trajectory_avg_score, response_evaluation_score, response_match_score, safety_v1.