Metrics and Scoring
How each built-in evaluation metric works, how scores are calculated, and how to configure thresholds
The evaluation framework ships with five prebuilt metrics. Each metric is implemented as an Evaluator subclass and registered in the MetricEvaluatorRegistry.
Overview
| Metric Key | Evaluator Class | Range | Description |
|---|---|---|---|
tool_trajectory_avg_score | TrajectoryEvaluator | 0-1 | Exact match of tool call sequences |
response_match_score | ResponseEvaluator | 0-1 | ROUGE-1 F-measure text similarity |
response_evaluation_score | ResponseEvaluator (Vertex AI) | 1-5 | LLM-based qualitative scoring |
safety_v1 | SafetyEvaluatorV1 | 0-1 | Safety / harmlessness check |
final_response_match_v2 | FinalResponseMatchV2Evaluator | 0-1 | LLM judge binary validity (experimental) |
These keys are defined in the PrebuiltMetrics enum:
import { PrebuiltMetrics } from "@iqai/adk";
const criteria = {
[PrebuiltMetrics.TOOL_TRAJECTORY_AVG_SCORE]: 1.0,
[PrebuiltMetrics.RESPONSE_MATCH_SCORE]: 0.8,
};Tool Trajectory Score
Key: tool_trajectory_avg_score
Evaluator: TrajectoryEvaluator
Range: 0.0 - 1.0
Compares the actual sequence of tool calls made by the agent against the expected sequence defined in your test case's intermediateData.toolUses.
How It Works
- For each invocation, the evaluator compares actual vs expected
FunctionCallarrays - It checks exact match on both
nameandargs(JSON equality) for every tool call - The sequence must match exactly — same tools, same order, same arguments
- Each invocation scores 1.0 (match) or 0.0 (mismatch)
- The final score is the average across all invocations
What Triggers a Mismatch
- Different number of tool calls
- Different tool names at any position
- Different arguments (keys or values) for any tool call
- Different ordering of tool calls
Example Test Data
{
"intermediateData": {
"toolUses": [
{ "name": "search_web", "args": { "query": "TypeScript generics" } },
{ "name": "summarize", "args": { "maxLength": 200 } }
],
"intermediateResponses": []
}
}The agent must call search_web with exactly { "query": "TypeScript generics" }, then summarize with exactly { "maxLength": 200 } to score 1.0.
Typical Threshold
{ "tool_trajectory_avg_score": 1.0 }Use 1.0 for strict validation (all tools must match). Lower to 0.5-0.8 if you expect some variability in tool usage across runs.
Response Match Score (ROUGE-1)
Key: response_match_score
Evaluator: ResponseEvaluator
Range: 0.0 - 1.0
Measures word-level overlap between the agent's actual response and the expected finalResponse using the ROUGE-1 F-measure.
How It Works
- Both the actual response and the reference are tokenized into lowercase unigrams (non-word characters removed)
- Precision = common unigrams / response unigrams
- Recall = common unigrams / reference unigrams
- F-measure = 2 x (precision x recall) / (precision + recall)
- Returns 0 if either text is empty
Scoring Examples
| Reference | Response | Score |
|---|---|---|
"The weather in London is sunny" | "The weather in London is sunny" | 1.0 |
"The weather in London is sunny" | "It's sunny in London today" | ~0.57 |
"4" | "The answer is 4" | ~0.4 |
"Hello world" | "Goodbye universe" | 0.0 |
When to Use
- Factual responses where specific words matter
- Short, precise answers (calculations, lookups)
- Cases where you have a clear "golden" reference
When to Prefer LLM-as-Judge Instead
- Open-ended responses where phrasing varies
- Long-form content where word overlap is a poor proxy for quality
- Responses where meaning matters more than exact wording
Typical Threshold
{ "response_match_score": 0.7 }Use 0.7-0.8 for factual answers. Use 0.5-0.6 for longer, more open-ended responses.
Response Evaluation Score (LLM-as-Judge)
Key: response_evaluation_score
Evaluator: ResponseEvaluator (delegates to Vertex AI)
Range: 1 - 5
Uses a separate LLM to qualitatively assess the agent's response against the reference. The judge model scores on a 1-5 scale.
How It Works
- A judge prompt is constructed with the user query, expected response, and actual response
- The judge LLM evaluates the response
- Multiple samples are taken for reliability
- The scores are aggregated
Configuration
Pass judgeModelOptions when working with the evaluator directly:
import type { JudgeModelOptions } from "@iqai/adk";
const judgeOptions: JudgeModelOptions = {
judgeModel: "gemini-2.5-flash",
numSamples: 5,
};Typical Threshold
{ "response_evaluation_score": 3.5 }Scores below 3 generally indicate poor responses. Aim for 3.5-4.0 for production agents.
Safety Score
Key: safety_v1
Evaluator: SafetyEvaluatorV1
Range: 0 - 1 (binary: 0 = unsafe, 1 = safe)
Evaluates whether the agent's response is safe and harmless.
Typical Threshold
{ "safety_v1": 1.0 }Safety is typically a hard requirement — set to 1.0 to enforce that every response must pass.
Final Response Match V2 (Experimental)
Key: final_response_match_v2
Evaluator: FinalResponseMatchV2Evaluator
Range: 0 - 1
Experimental
This metric may change in future releases. Add it explicitly to your criteria if you want to use it — it is not included in default criteria.
Uses an LLM judge to determine binary validity: is the actual response a valid answer given the reference? Returns the proportion of "valid" judgments across multiple samples.
Typical Threshold
{ "final_response_match_v2": 0.8 }How Scoring Works End-to-End
1. Inference
The LocalEvalService runs the agent against each eval case. For each Invocation in the conversation, it sends the userContent to the agent and records the actual response and tool calls.
2. Metric Evaluation
For each metric in your criteria, the corresponding evaluator compares actual vs expected invocations:
TrajectoryEvaluatorcomparesintermediateData.toolUsesResponseEvaluatorcomparesfinalResponsetext (ROUGE-1 or LLM judge depending on metric)SafetyEvaluatorV1andFinalResponseMatchV2Evaluatoruse their respective comparison logic
3. Aggregation
Scores are calculated per-invocation, then averaged across all invocations within a test case, then averaged across all runs (when numRuns > 1).
4. Threshold Check
If any metric's aggregated score falls below its threshold, AgentEvaluator throws an Error with details:
response_match_score for my_agent Failed. Expected 0.8, but got 0.65.Enable detailed output to see per-case breakdowns:
await AgentEvaluator.evaluateEvalSet(
agent,
evalSet,
criteria,
2,
true, // printDetailedResults — outputs a table with per-case scores
);Choosing the Right Metrics
| Scenario | Recommended Metrics |
|---|---|
| Tool-heavy agent (API calls, searches) | tool_trajectory_avg_score |
| Factual Q&A agent | response_match_score |
| Conversational / creative agent | response_evaluation_score |
| Safety-critical application | safety_v1 |
| General production agent | tool_trajectory_avg_score + response_match_score |
Related Topics
🧪 Testing Agents
Write test cases, configure criteria, and run evaluations with AgentEvaluator
📐 Evaluation Patterns
Domain-specific evaluation strategies, CI/CD gates, and production monitoring
💡 Evaluation Concepts
Core principles and challenges in agent evaluation
🔧 Tools
Tool integration — essential for trajectory scoring