TypeScriptADK-TS

Metrics and Scoring

How each built-in evaluation metric works, how scores are calculated, and how to configure thresholds

The evaluation framework ships with five prebuilt metrics. Each metric is implemented as an Evaluator subclass and registered in the MetricEvaluatorRegistry.

Overview

Metric KeyEvaluator ClassRangeDescription
tool_trajectory_avg_scoreTrajectoryEvaluator0-1Exact match of tool call sequences
response_match_scoreResponseEvaluator0-1ROUGE-1 F-measure text similarity
response_evaluation_scoreResponseEvaluator (Vertex AI)1-5LLM-based qualitative scoring
safety_v1SafetyEvaluatorV10-1Safety / harmlessness check
final_response_match_v2FinalResponseMatchV2Evaluator0-1LLM judge binary validity (experimental)

These keys are defined in the PrebuiltMetrics enum:

import { PrebuiltMetrics } from "@iqai/adk";

const criteria = {
  [PrebuiltMetrics.TOOL_TRAJECTORY_AVG_SCORE]: 1.0,
  [PrebuiltMetrics.RESPONSE_MATCH_SCORE]: 0.8,
};

Tool Trajectory Score

Key: tool_trajectory_avg_score Evaluator: TrajectoryEvaluator Range: 0.0 - 1.0

Compares the actual sequence of tool calls made by the agent against the expected sequence defined in your test case's intermediateData.toolUses.

How It Works

  1. For each invocation, the evaluator compares actual vs expected FunctionCall arrays
  2. It checks exact match on both name and args (JSON equality) for every tool call
  3. The sequence must match exactly — same tools, same order, same arguments
  4. Each invocation scores 1.0 (match) or 0.0 (mismatch)
  5. The final score is the average across all invocations

What Triggers a Mismatch

  • Different number of tool calls
  • Different tool names at any position
  • Different arguments (keys or values) for any tool call
  • Different ordering of tool calls

Example Test Data

{
  "intermediateData": {
    "toolUses": [
      { "name": "search_web", "args": { "query": "TypeScript generics" } },
      { "name": "summarize", "args": { "maxLength": 200 } }
    ],
    "intermediateResponses": []
  }
}

The agent must call search_web with exactly { "query": "TypeScript generics" }, then summarize with exactly { "maxLength": 200 } to score 1.0.

Typical Threshold

{ "tool_trajectory_avg_score": 1.0 }

Use 1.0 for strict validation (all tools must match). Lower to 0.5-0.8 if you expect some variability in tool usage across runs.

Response Match Score (ROUGE-1)

Key: response_match_score Evaluator: ResponseEvaluator Range: 0.0 - 1.0

Measures word-level overlap between the agent's actual response and the expected finalResponse using the ROUGE-1 F-measure.

How It Works

  1. Both the actual response and the reference are tokenized into lowercase unigrams (non-word characters removed)
  2. Precision = common unigrams / response unigrams
  3. Recall = common unigrams / reference unigrams
  4. F-measure = 2 x (precision x recall) / (precision + recall)
  5. Returns 0 if either text is empty

Scoring Examples

ReferenceResponseScore
"The weather in London is sunny""The weather in London is sunny"1.0
"The weather in London is sunny""It's sunny in London today"~0.57
"4""The answer is 4"~0.4
"Hello world""Goodbye universe"0.0

When to Use

  • Factual responses where specific words matter
  • Short, precise answers (calculations, lookups)
  • Cases where you have a clear "golden" reference

When to Prefer LLM-as-Judge Instead

  • Open-ended responses where phrasing varies
  • Long-form content where word overlap is a poor proxy for quality
  • Responses where meaning matters more than exact wording

Typical Threshold

{ "response_match_score": 0.7 }

Use 0.7-0.8 for factual answers. Use 0.5-0.6 for longer, more open-ended responses.

Response Evaluation Score (LLM-as-Judge)

Key: response_evaluation_score Evaluator: ResponseEvaluator (delegates to Vertex AI) Range: 1 - 5

Uses a separate LLM to qualitatively assess the agent's response against the reference. The judge model scores on a 1-5 scale.

How It Works

  1. A judge prompt is constructed with the user query, expected response, and actual response
  2. The judge LLM evaluates the response
  3. Multiple samples are taken for reliability
  4. The scores are aggregated

Configuration

Pass judgeModelOptions when working with the evaluator directly:

import type { JudgeModelOptions } from "@iqai/adk";

const judgeOptions: JudgeModelOptions = {
  judgeModel: "gemini-2.5-flash",
  numSamples: 5,
};

Typical Threshold

{ "response_evaluation_score": 3.5 }

Scores below 3 generally indicate poor responses. Aim for 3.5-4.0 for production agents.

Safety Score

Key: safety_v1 Evaluator: SafetyEvaluatorV1 Range: 0 - 1 (binary: 0 = unsafe, 1 = safe)

Evaluates whether the agent's response is safe and harmless.

Typical Threshold

{ "safety_v1": 1.0 }

Safety is typically a hard requirement — set to 1.0 to enforce that every response must pass.

Final Response Match V2 (Experimental)

Key: final_response_match_v2 Evaluator: FinalResponseMatchV2Evaluator Range: 0 - 1

Experimental

This metric may change in future releases. Add it explicitly to your criteria if you want to use it — it is not included in default criteria.

Uses an LLM judge to determine binary validity: is the actual response a valid answer given the reference? Returns the proportion of "valid" judgments across multiple samples.

Typical Threshold

{ "final_response_match_v2": 0.8 }

How Scoring Works End-to-End

1. Inference

The LocalEvalService runs the agent against each eval case. For each Invocation in the conversation, it sends the userContent to the agent and records the actual response and tool calls.

2. Metric Evaluation

For each metric in your criteria, the corresponding evaluator compares actual vs expected invocations:

  • TrajectoryEvaluator compares intermediateData.toolUses
  • ResponseEvaluator compares finalResponse text (ROUGE-1 or LLM judge depending on metric)
  • SafetyEvaluatorV1 and FinalResponseMatchV2Evaluator use their respective comparison logic

3. Aggregation

Scores are calculated per-invocation, then averaged across all invocations within a test case, then averaged across all runs (when numRuns > 1).

4. Threshold Check

If any metric's aggregated score falls below its threshold, AgentEvaluator throws an Error with details:

response_match_score for my_agent Failed. Expected 0.8, but got 0.65.

Enable detailed output to see per-case breakdowns:

await AgentEvaluator.evaluateEvalSet(
  agent,
  evalSet,
  criteria,
  2,
  true, // printDetailedResults — outputs a table with per-case scores
);

Choosing the Right Metrics

ScenarioRecommended Metrics
Tool-heavy agent (API calls, searches)tool_trajectory_avg_score
Factual Q&A agentresponse_match_score
Conversational / creative agentresponse_evaluation_score
Safety-critical applicationsafety_v1
General production agenttool_trajectory_avg_score + response_match_score