Practical patterns for organizing evaluations, CI/CD integration, custom evaluators, and testing multi-agent systems

Organizing Test Suites

By Capability

Separate test files by what they validate:

evaluation/
├── tool-usage/
│   ├── search.test.json         # Search tool tests
│   ├── calculator.test.json     # Calculator tool tests
│   └── test_config.json         # Strict trajectory matching
├── response-quality/
│   ├── factual.test.json        # Factual Q&A tests
│   ├── conversational.test.json # Open-ended response tests
│   └── test_config.json         # Response scoring only
└── safety/
    ├── harmful-inputs.test.json # Adversarial prompts
    └── test_config.json         # Safety metric only

Each directory gets its own test_config.json with criteria appropriate to what it tests:

import { AgentEvaluator } from "@iqai/adk";

// Run all test suites
await AgentEvaluator.evaluate(agent, "./evaluation");

// Or run a specific suite
await AgentEvaluator.evaluate(agent, "./evaluation/tool-usage");

By Environment

Use different thresholds for different stages:

evaluation/
├── smoke/                 # Fast, loose thresholds — run on every commit
│   ├── basic.test.json
│   └── test_config.json   # { "response_match_score": 0.5 }
├── standard/              # Full suite — run on PR
│   ├── comprehensive.test.json
│   └── test_config.json   # { "response_match_score": 0.8, "tool_trajectory_avg_score": 1.0 }
└── release/               # Strict — run before deploy
    ├── regression.test.json
    └── test_config.json   # { "response_match_score": 0.9, "safety_v1": 1.0 }

CI/CD Integration

GitHub Actions

name: Agent Evaluation
on:
  pull_request:
    paths:
      - "src/agents/**"
      - "evaluation/**"

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 22
      - run: pnpm install
      - run: pnpm build
      - name: Run agent evaluation
        run: pnpm tsx evaluation/run.ts
        env:
          GOOGLE_API_KEY: ${{ secrets.GOOGLE_API_KEY }}

With a simple runner script:

// evaluation/run.ts
import { AgentEvaluator } from "@iqai/adk";
import { getMyAgent } from "../src/agents/my-agent";

const { agent } = await getMyAgent();

try {
  await AgentEvaluator.evaluate(agent, "./evaluation/standard");
  console.log("Evaluation passed");
} catch (error) {
  console.error("Evaluation failed:", error.message);
  process.exit(1);
}

Quality Gates

Use evaluation as a deployment gate — AgentEvaluator.evaluate() throws on failure, so a non-zero exit code blocks the pipeline:

// evaluation/release-gate.ts
import { AgentEvaluator } from "@iqai/adk";
import { getMyAgent } from "../src/agents/my-agent";

const { agent } = await getMyAgent();

// Strict evaluation for release — 5 runs for higher statistical confidence
await AgentEvaluator.evaluate(agent, "./evaluation/release", 5);

Programmatic Evaluation

For dynamic test generation or custom workflows, use evaluateEvalSet() directly:

import { AgentEvaluator } from "@iqai/adk";
import type { EvalSet } from "@iqai/adk";

// Map each scenario to its expected response
function getExpectedResponse(scenario: string): string {
  const responses: Record<string, string> = {
    "What is 2 + 2?": "4",
    "What is the capital of France?": "Paris",
    "Explain photosynthesis briefly.":
      "Photosynthesis is the process by which plants convert sunlight into energy.",
  };
  return responses[scenario] ?? "I don't have an answer for that.";
}

function generateTestCases(scenarios: string[]): EvalSet {
  return {
    evalSetId: `dynamic-${Date.now()}`,
    creationTimestamp: Date.now(),
    evalCases: scenarios.map((scenario, i) => ({
      evalId: `case-${i}`,
      conversation: [
        {
          creationTimestamp: Date.now(),
          userContent: {
            role: "user" as const,
            parts: [{ text: scenario }],
          },
          finalResponse: {
            role: "model" as const,
            parts: [{ text: getExpectedResponse(scenario) }],
          },
        },
      ],
    })),
  };
}

const evalSet = generateTestCases([
  "What is 2 + 2?",
  "What is the capital of France?",
  "Explain photosynthesis briefly.",
]);

await AgentEvaluator.evaluateEvalSet(
  agent,
  evalSet,
  { response_match_score: 0.7 },
  3, // numRuns
  true, // printDetailedResults
);

Custom Evaluators

Extend the Evaluator base class to create custom metrics:

import { Evaluator, EvalStatus } from "@iqai/adk";
import type {
  EvaluationResult,
  PerInvocationResult,
  MetricInfo,
  Invocation,
} from "@iqai/adk";

class ResponseLengthEvaluator extends Evaluator {
  static override getMetricInfo(): MetricInfo {
    return {
      metricName: "response_length_score",
      description:
        "Checks that responses are within an acceptable length range",
      metricValueInfo: {
        interval: {
          minValue: 0,
          maxValue: 1,
          openAtMin: false,
          openAtMax: false,
        },
      },
    };
  }

  async evaluateInvocations(
    actualInvocations: Invocation[],
    expectedInvocations: Invocation[],
  ): Promise<EvaluationResult> {
    const perInvocationResults: PerInvocationResult[] = [];
    let totalScore = 0;

    for (let i = 0; i < actualInvocations.length; i++) {
      const actual = actualInvocations[i];
      const expected = expectedInvocations[i];
      const responseText =
        actual.finalResponse?.parts?.map(p => p.text).join("") ?? "";

      // Score based on whether response length is reasonable
      const wordCount = responseText.split(/\s+/).length;
      const score = wordCount >= 5 && wordCount <= 500 ? 1.0 : 0.0;

      perInvocationResults.push({
        actualInvocation: actual,
        expectedInvocation: expected,
        score,
        evalStatus:
          score >= this.metric.threshold
            ? EvalStatus.PASSED
            : EvalStatus.FAILED,
      });

      totalScore += score;
    }

    const overallScore = totalScore / actualInvocations.length;
    return {
      overallScore,
      overallEvalStatus:
        overallScore >= this.metric.threshold
          ? EvalStatus.PASSED
          : EvalStatus.FAILED,
      perInvocationResults,
    };
  }
}

Testing Multi-Agent Systems

When testing SequentialAgent, ParallelAgent, or other composite agents, evaluate the overall system behavior rather than individual sub-agents:

import { AgentEvaluator, SequentialAgent, LlmAgent } from "@iqai/adk";

const researcher = new LlmAgent({
  name: "researcher",
  model: "gemini-2.5-flash",
  instruction: "Research the topic thoroughly.",
  tools: [searchTool], // Assume searchTool is a pre-defined tool.
});

const writer = new LlmAgent({
  name: "writer",
  model: "gemini-2.5-flash",
  instruction: "Write a summary based on the research.",
});

const pipeline = new SequentialAgent({
  name: "research_pipeline",
  description: "Research and summarize a topic",
  subAgents: [researcher, writer],
});

// Evaluate the pipeline as a whole
await AgentEvaluator.evaluate(pipeline, "./evaluation/pipeline-tests");

Test cases for multi-agent systems should focus on end-to-end behavior — the final output and overall tool usage — rather than the internal handoffs between agents.

Tuning with Multiple Runs

LLM outputs vary between runs. Use multiple runs to get stable scores:

// Development: quick feedback
await AgentEvaluator.evaluate(agent, "./tests", 1);

// CI: balanced speed and confidence
await AgentEvaluator.evaluate(agent, "./tests", 2);

// Release gate: high confidence
await AgentEvaluator.evaluate(agent, "./tests", 5);

Scores are averaged across runs. If your agent scores 0.9 on one run and 0.7 on another, the final score is 0.8.

Debugging Failed Evaluations

Enable Detailed Output

Use evaluateEvalSet() with printDetailedResults: true to see a per-case breakdown:

import * as fs from "node:fs/promises";
import { AgentEvaluator } from "@iqai/adk";

const evalSet = JSON.parse(
  await fs.readFile("./tests/basic.test.json", "utf-8"),
);
const config = JSON.parse(
  await fs.readFile("./tests/test_config.json", "utf-8"),
);

await AgentEvaluator.evaluateEvalSet(
  agent,
  evalSet,
  config.criteria,
  1,
  true, // Prints table: prompt, expected, actual, score per case
);

Common Failure Causes

Symptom	Likely Cause	Fix
`tool_trajectory_avg_score` always 0	Tool args don't match exactly	Check argument spelling, types, and casing
`response_match_score` low	Agent rephrases the answer	Loosen threshold or use `response_evaluation_score` instead
All metrics fail	Agent not calling the right model	Verify model name and API key
Inconsistent scores across runs	High LLM temperature	Lower temperature or increase `numRuns`

Start with loose thresholds to establish a baseline
Identify which test cases fail and why
Fix agent behavior (instructions, tools, model) or adjust expectations
Tighten thresholds as the agent improves
Add new test cases for failure modes discovered in production

Best Practice

Treat evaluation scores as trends, not absolutes. A score dropping from 0.85 to 0.75 across releases is more meaningful than any single score.

Evaluation Patterns