Why agent evaluation differs from traditional testing, and the principles behind the ADK-TS evaluation framework

Why You Can't Just Unit Test an Agent

With traditional code, you write a test like expect(add(2, 2)).toBe(4) and it either passes or fails. Agents don't work that way — ask the same question twice and you might get two different (but equally valid) answers. The tools it calls, the reasoning path it takes, even the phrasing of its response can all vary between runs.

This means standard assertions break down. You need a different approach: instead of checking for exact matches, you score the agent's behavior across multiple dimensions and check that those scores stay above acceptable thresholds.

What Makes Agents Hard to Test

Non-deterministic outputs — the same input can produce different valid responses across runs
Variable tool usage — agents may take different paths through tool calls and reasoning to reach the same result
Argument sensitivity — it's not just which tools are called, but what arguments they're called with
Stateful behavior — performance can shift based on conversation history and session state
Compound effects — changes to prompts, tools, or models can interact in unexpected ways

What You Get From Evaluation

Regression detection — catch performance drops before they reach users
Confidence in changes — modify prompts, swap models, or add tools knowing you have a safety net
Production readiness — validate agent behavior systematically before deployment
Measurable progress — track improvement with concrete scores over time

Start Early

Even a handful of test cases with basic thresholds catches most regressions. You don't need a perfect evaluation suite to get value — start simple and expand as your agent matures.

Evaluation Dimensions

The evaluation framework measures agents across three dimensions, each with dedicated metrics.

Tool Trajectory

Did the agent call the right tools, in the right order, with the right arguments?

The TrajectoryEvaluator compares the actual sequence of tool calls against an expected sequence defined in your test cases. It performs an exact match on tool names and arguments.

What it catches:

Wrong tool selected for a task
Missing tool calls in a workflow
Incorrect arguments passed to tools
Unnecessary extra tool calls

Response Quality

Is the agent's final response accurate and relevant?

Two approaches are available:

ROUGE-1 matching — compares word overlap between the actual response and a reference response. Fast, deterministic, and good for factual answers.
LLM-as-judge — uses a separate LLM to qualitatively assess response quality. Better for open-ended responses where multiple valid answers exist.

Safety

Is the response safe and harmless?

The SafetyEvaluatorV1 evaluates responses for harmful content, producing a binary pass/fail score.

Key Principles

Threshold-Based, Not Exact-Match

Because agent outputs vary between runs, the framework uses thresholds rather than exact assertions. You define minimum acceptable scores for each metric. For example, tool calls must match exactly (threshold 1.0), but response text only needs 80% ROUGE-1 similarity (threshold 0.8).

Multiple Runs for Confidence

A single run can be misleading due to LLM variability. The evaluator defaults to 2 runs per test case and averages the scores. Increase this for higher confidence before deploying.

Fail-Fast Design

The evaluator throws an Error when any metric falls below its threshold. This makes it easy to integrate into CI pipelines and test runners — a non-zero exit code means something regressed.

Composable Metrics

Each metric is evaluated independently. Mix and match based on what matters for your use case:

Testing a tool-heavy agent? Focus on tool_trajectory_avg_score.
Testing a conversational agent? Focus on response_match_score or response_evaluation_score.
Need safety guarantees? Add safety_v1.

Evaluation Limitations

Evaluation is a proxy for real-world performance. High evaluation scores don't guarantee production success, and low scores on one metric don't necessarily mean the agent is broken. Use evaluation as one signal among many.

Evaluation Concepts