Evaluation Patterns
Practical patterns for organizing evaluations, CI/CD integration, custom evaluators, and testing multi-agent systems
Organizing Test Suites
By Capability
Separate test files by what they validate:
evaluation/
├── tool-usage/
│ ├── search.test.json # Search tool tests
│ ├── calculator.test.json # Calculator tool tests
│ └── test_config.json # Strict trajectory matching
├── response-quality/
│ ├── factual.test.json # Factual Q&A tests
│ ├── conversational.test.json # Open-ended response tests
│ └── test_config.json # Response scoring only
└── safety/
├── harmful-inputs.test.json # Adversarial prompts
└── test_config.json # Safety metric onlyEach directory gets its own test_config.json with criteria appropriate to what it tests:
import { AgentEvaluator } from "@iqai/adk";
// Run all test suites
await AgentEvaluator.evaluate(agent, "./evaluation");
// Or run a specific suite
await AgentEvaluator.evaluate(agent, "./evaluation/tool-usage");By Environment
Use different thresholds for different stages:
evaluation/
├── smoke/ # Fast, loose thresholds — run on every commit
│ ├── basic.test.json
│ └── test_config.json # { "response_match_score": 0.5 }
├── standard/ # Full suite — run on PR
│ ├── comprehensive.test.json
│ └── test_config.json # { "response_match_score": 0.8, "tool_trajectory_avg_score": 1.0 }
└── release/ # Strict — run before deploy
├── regression.test.json
└── test_config.json # { "response_match_score": 0.9, "safety_v1": 1.0 }CI/CD Integration
GitHub Actions
name: Agent Evaluation
on:
pull_request:
paths:
- "src/agents/**"
- "evaluation/**"
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: 22
- run: pnpm install
- run: pnpm build
- name: Run agent evaluation
run: pnpm tsx evaluation/run.ts
env:
GOOGLE_API_KEY: ${{ secrets.GOOGLE_API_KEY }}With a simple runner script:
// evaluation/run.ts
import { AgentEvaluator } from "@iqai/adk";
import { getMyAgent } from "../src/agents/my-agent";
const { agent } = await getMyAgent();
try {
await AgentEvaluator.evaluate(agent, "./evaluation/standard");
console.log("Evaluation passed");
} catch (error) {
console.error("Evaluation failed:", error.message);
process.exit(1);
}Quality Gates
Use evaluation as a deployment gate — AgentEvaluator.evaluate() throws on failure, so a non-zero exit code blocks the pipeline:
// evaluation/release-gate.ts
import { AgentEvaluator } from "@iqai/adk";
import { getMyAgent } from "../src/agents/my-agent";
const { agent } = await getMyAgent();
// Strict evaluation for release — 5 runs for higher statistical confidence
await AgentEvaluator.evaluate(agent, "./evaluation/release", 5);Programmatic Evaluation
For dynamic test generation or custom workflows, use evaluateEvalSet() directly:
import { AgentEvaluator } from "@iqai/adk";
import type { EvalSet } from "@iqai/adk";
// Map each scenario to its expected response
function getExpectedResponse(scenario: string): string {
const responses: Record<string, string> = {
"What is 2 + 2?": "4",
"What is the capital of France?": "Paris",
"Explain photosynthesis briefly.":
"Photosynthesis is the process by which plants convert sunlight into energy.",
};
return responses[scenario] ?? "I don't have an answer for that.";
}
function generateTestCases(scenarios: string[]): EvalSet {
return {
evalSetId: `dynamic-${Date.now()}`,
creationTimestamp: Date.now(),
evalCases: scenarios.map((scenario, i) => ({
evalId: `case-${i}`,
conversation: [
{
creationTimestamp: Date.now(),
userContent: {
role: "user" as const,
parts: [{ text: scenario }],
},
finalResponse: {
role: "model" as const,
parts: [{ text: getExpectedResponse(scenario) }],
},
},
],
})),
};
}
const evalSet = generateTestCases([
"What is 2 + 2?",
"What is the capital of France?",
"Explain photosynthesis briefly.",
]);
await AgentEvaluator.evaluateEvalSet(
agent,
evalSet,
{ response_match_score: 0.7 },
3, // numRuns
true, // printDetailedResults
);Custom Evaluators
Extend the Evaluator base class to create custom metrics:
import { Evaluator, EvalStatus } from "@iqai/adk";
import type {
EvaluationResult,
PerInvocationResult,
MetricInfo,
Invocation,
} from "@iqai/adk";
class ResponseLengthEvaluator extends Evaluator {
static override getMetricInfo(): MetricInfo {
return {
metricName: "response_length_score",
description:
"Checks that responses are within an acceptable length range",
metricValueInfo: {
interval: {
minValue: 0,
maxValue: 1,
openAtMin: false,
openAtMax: false,
},
},
};
}
async evaluateInvocations(
actualInvocations: Invocation[],
expectedInvocations: Invocation[],
): Promise<EvaluationResult> {
const perInvocationResults: PerInvocationResult[] = [];
let totalScore = 0;
for (let i = 0; i < actualInvocations.length; i++) {
const actual = actualInvocations[i];
const expected = expectedInvocations[i];
const responseText =
actual.finalResponse?.parts?.map(p => p.text).join("") ?? "";
// Score based on whether response length is reasonable
const wordCount = responseText.split(/\s+/).length;
const score = wordCount >= 5 && wordCount <= 500 ? 1.0 : 0.0;
perInvocationResults.push({
actualInvocation: actual,
expectedInvocation: expected,
score,
evalStatus:
score >= this.metric.threshold
? EvalStatus.PASSED
: EvalStatus.FAILED,
});
totalScore += score;
}
const overallScore = totalScore / actualInvocations.length;
return {
overallScore,
overallEvalStatus:
overallScore >= this.metric.threshold
? EvalStatus.PASSED
: EvalStatus.FAILED,
perInvocationResults,
};
}
}Testing Multi-Agent Systems
When testing SequentialAgent, ParallelAgent, or other composite agents, evaluate the overall system behavior rather than individual sub-agents:
import { AgentEvaluator, SequentialAgent, LlmAgent } from "@iqai/adk";
const researcher = new LlmAgent({
name: "researcher",
model: "gemini-2.5-flash",
instruction: "Research the topic thoroughly.",
tools: [searchTool], // Assume searchTool is a pre-defined tool.
});
const writer = new LlmAgent({
name: "writer",
model: "gemini-2.5-flash",
instruction: "Write a summary based on the research.",
});
const pipeline = new SequentialAgent({
name: "research_pipeline",
description: "Research and summarize a topic",
subAgents: [researcher, writer],
});
// Evaluate the pipeline as a whole
await AgentEvaluator.evaluate(pipeline, "./evaluation/pipeline-tests");Test cases for multi-agent systems should focus on end-to-end behavior — the final output and overall tool usage — rather than the internal handoffs between agents.
Tuning with Multiple Runs
LLM outputs vary between runs. Use multiple runs to get stable scores:
// Development: quick feedback
await AgentEvaluator.evaluate(agent, "./tests", 1);
// CI: balanced speed and confidence
await AgentEvaluator.evaluate(agent, "./tests", 2);
// Release gate: high confidence
await AgentEvaluator.evaluate(agent, "./tests", 5);Scores are averaged across runs. If your agent scores 0.9 on one run and 0.7 on another, the final score is 0.8.
Debugging Failed Evaluations
Enable Detailed Output
Use evaluateEvalSet() with printDetailedResults: true to see a per-case breakdown:
import * as fs from "node:fs/promises";
import { AgentEvaluator } from "@iqai/adk";
const evalSet = JSON.parse(
await fs.readFile("./tests/basic.test.json", "utf-8"),
);
const config = JSON.parse(
await fs.readFile("./tests/test_config.json", "utf-8"),
);
await AgentEvaluator.evaluateEvalSet(
agent,
evalSet,
config.criteria,
1,
true, // Prints table: prompt, expected, actual, score per case
);Common Failure Causes
| Symptom | Likely Cause | Fix |
|---|---|---|
tool_trajectory_avg_score always 0 | Tool args don't match exactly | Check argument spelling, types, and casing |
response_match_score low | Agent rephrases the answer | Loosen threshold or use response_evaluation_score instead |
| All metrics fail | Agent not calling the right model | Verify model name and API key |
| Inconsistent scores across runs | High LLM temperature | Lower temperature or increase numRuns |
Iterative Refinement
- Start with loose thresholds to establish a baseline
- Identify which test cases fail and why
- Fix agent behavior (instructions, tools, model) or adjust expectations
- Tighten thresholds as the agent improves
- Add new test cases for failure modes discovered in production
Best Practice
Treat evaluation scores as trends, not absolutes. A score dropping from 0.85 to 0.75 across releases is more meaningful than any single score.