Agent Evaluation

Comprehensive framework for testing and validating agent performance across scenarios

Agent evaluation provides systematic approaches to testing and validating agent performance, moving beyond prototype to production-ready AI systems.

Status

Evaluation is available in @iqai/adk (public but still evolving). Surface may receive additive changes; core evaluator signatures are stable.

Overview

Unlike traditional software testing, agent evaluation must account for the probabilistic nature of LLM responses and the complexity of multi-step reasoning processes. Effective evaluation encompasses multiple dimensions from tool usage patterns to response quality.

Key Features

The ADK evaluation framework provides comprehensive agent testing capabilities:

Agent Evaluator: Automated evaluation of agent performance across test cases
Multiple Metrics: ROUGE scoring, LLM-as-judge, tool trajectory analysis, safety evaluation
Flexible Test Format: JSON-based test cases with support for multi-turn conversations
Local & Cloud Evaluation: Run evaluations locally or integrate with Vertex AI
Session State Support: Test stateful agents with conversation history
Configurable Thresholds: Set pass/fail criteria for different evaluation metrics

AgentEvaluator: Main entry point for agent performance assessment
TrajectoryEvaluator: Analyzes tool usage patterns and decision paths
ResponseEvaluator: ROUGE-1 scoring and LLM-based response quality assessment
SafetyEvaluator: Evaluates response harmlessness and safety
EvalSet Management: Organized test cases with metadata and version control
MetricEvaluatorRegistry: Extensible system for custom evaluation metrics
LocalEvalService: Complete local evaluation pipeline with parallel execution

Getting Started

Workflow:

Author Dataset: Provide either:
- Legacy array format: [{ "query": "...", "reference": "..." }] (auto-migrated at runtime – deprecated warning emitted), or
- New EvalSet schema (preferred):

{
  "evalSetId": "calc-v1",
  "name": "Simple arithmetic",
  "creationTimestamp": 0,
  "evalCases": [
    {
      "evalId": "case-1",
      "conversation": [
        {
          "creationTimestamp": 0,
          "userContent": { "role": "user", "parts": [{ "text": "What is 2 + 2?" }] },
          "finalResponse": { "role": "model", "parts": [{ "text": "4" }] }
        }
      ]
    }
  ]
}

Configure Criteria in test_config.json (sibling to each *.test.json):

{ "criteria": { "response_match_score": 0.8 } }

Run Evaluation – signature: await AgentEvaluator.evaluate(agent, pathOrFile, numRuns?).
- Throws an Error if any metric fails threshold (no return value).
Iterate – tighten thresholds / expand cases as agent improves.

Minimal usage example:

import { AgentBuilder, AgentEvaluator } from '@iqai/adk';

const { agent } = await AgentBuilder.create('eval_agent')
  .withModel('gemini-2.5-flash')
  .withInstruction('Answer briefly and accurately.')
  .build();

await AgentEvaluator.evaluate(agent, './evaluation/tests');
// If it does not throw, all criteria passed.

Programmatic (advanced) usage using an already loaded EvalSet:

// If you parsed an EvalSet JSON yourself
// import type { Evaluation } from '@iqai/adk';
// await AgentEvaluator.evaluateEvalSet(agent, evalSet, { response_match_score: 0.8 });

Default criteria if test_config.json absent:

{ "tool_trajectory_avg_score": 1.0, "response_match_score": 0.8 }

Each metric key must be one of: tool_trajectory_avg_score, response_evaluation_score, response_match_score, safety_v1.

Agent Evaluation

Overview

Key Features

Documentation Structure

📋 Evaluation Concepts

🧪 Testing Agents

📊 Metrics and Scoring

🎯 Evaluation Patterns

Core Components

Getting Started

🤖 Agents

🔧 Tools

💬 Sessions

📋 Callbacks

On this page

Agent Evaluation

Overview

Key Features

Documentation Structure

📋 Evaluation Concepts

🧪 Testing Agents

📊 Metrics and Scoring

🎯 Evaluation Patterns

Core Components

Getting Started

Related Topics

🤖 Agents

🔧 Tools

💬 Sessions

📋 Callbacks

On this page