Metrics and Scoring

Effective agent evaluation requires quantifiable metrics that capture both the quality of agent decisions and the value of their outputs.

Implemented Metrics

Currently implemented prebuilt metrics: tool_trajectory_avg_score, response_match_score (ROUGE‑1), response_evaluation_score (LLM judge 1‑5), safety_v1, and experimental final_response_match_v2 (LLM judge binary). Configure via criteria thresholds.

Trajectory Metrics

Trajectory metrics measure the quality of the decision-making process and tool usage patterns.

Tool Usage Analysis

Tool Selection Accuracy:

Metric: Percentage of appropriate tool choices for given tasks
Calculation: (Correct tool selections / Total tool selections) * 100
Threshold: Typically 85%+ for production agents

Tool Trajectory Average Score:

- Purpose: Compares actual vs expected tool usage sequences
Scoring: Binary (1 for matches, 0 for mismatches)
Patterns:
- Exact match: Perfect sequence alignment
- In-order match: Correct order, allowing extra steps
- Any-order match: Required tools used regardless of order

Current Implementation Example:

function calculateToolAccuracy(events: Event[], expectedTools: string[]) {
  const actualTools = events
    .flatMap(e => e.getFunctionCalls())
    .map(fc => fc.name);

  const correctSelections = expectedTools.filter(tool =>
    actualTools.includes(tool)
  ).length;

  const accuracy = (correctSelections / expectedTools.length) * 100;
  console.log(`Tool selection accuracy: ${accuracy}%`);

  return {
    accuracy,
    expected: expectedTools,
    actual: actualTools,
    correct: correctSelections
  };
}

Decision Quality Metrics

Step Efficiency:

Definition: Ratio of necessary steps to total steps taken
Formula: Necessary steps / Total steps
Target: Minimize unnecessary actions while maintaining effectiveness

Error Recovery Rate:

Measurement: Percentage of errors successfully recovered from
Calculation: (Successful recoveries / Total errors) * 100
Assessment: Track how agents handle tool failures and incorrect responses

Strategy Consistency:

Evaluation: Consistent approach across similar scenarios
Measurement: Variance in tool usage patterns for equivalent tasks
Goal: Minimize unnecessary strategy variation

Performance Metrics

Response Time Analysis:

async function measureResponseTime(agent: LlmAgent, query: string) {
  const startTime = Date.now();

  let responseComplete = false;
  const events = [];

  for await (const event of agent.runAsync(context)) {
    events.push(event);
    if (event.isFinalResponse()) {
      responseComplete = true;
      break;
    }
  }

  const totalTime = Date.now() - startTime;
  const toolCalls = events.flatMap(e => e.getFunctionCalls()).length;

  return {
    totalTime,
    toolCallCount: toolCalls,
    avgTimePerTool: toolCalls > 0 ? totalTime / toolCalls : 0,
    responseComplete
  };
}

Response Quality Metrics

Response quality metrics assess the value and appropriateness of agent outputs.

Semantic Similarity Scoring

ROUGE‑1 (response_match_score):

Key: response_match_score
Range: 0.0–1.0 (F1 overlap of unigrams)
Typical threshold: 0.7–0.9
Auto-applied when response_match_score present in criteria.

LLM Judge (response_evaluation_score):

Key: response_evaluation_score
Range: 1–5 (higher is better)
Provide threshold (e.g. 3.5) and optional judgeModelOptions (model + numSamples) when extending internals.

Tool Trajectory (tool_trajectory_avg_score):

Binary per case; average across runs.
Requires test samples with expected_tool_use (legacy path) or tool expectations embedded (future expansion).

Safety (safety_v1):

Integration for safety / harmlessness (returns pass/fail style score 0/1 internally mapped to threshold).

Experimental Final Response Match V2 (final_response_match_v2):

LLM judge binary validity vs reference; score in [0,1].
May change; not yet exposed by default criteria (add explicitly if needed).

Current Manual Assessment:

function assessResponseQuality(response: string, query: string) {
  const quality = {
    relevance: 0,
    completeness: 0,
    clarity: 0,
    accuracy: 0
  };

  // Relevance assessment
  if (responseAddressesQuery(response, query)) {
    quality.relevance = 1;
  }

  // Completeness check
  if (responseIsComplete(response, query)) {
    quality.completeness = 1;
  }

  // Clarity evaluation
  if (responseIsClear(response)) {
    quality.clarity = 1;
  }

  // Accuracy validation
  if (responseIsAccurate(response, query)) {
    quality.accuracy = 1;
  }

  const overallScore = Object.values(quality).reduce((a, b) => a + b) / 4;
  return { ...quality, overall: overallScore };
}

Content Analysis

Accuracy Assessment:

Factual Correctness: Verification against known ground truth
Consistency: Alignment with previous responses and established facts
Source Attribution: Proper citation when using external information

Relevance Measurement:

Query Alignment: How well response addresses the user question
Context Appropriateness: Suitable for the conversation context
User Intent: Fulfillment of underlying user needs

Completeness Evaluation:

Information Coverage: All necessary information provided
Follow-up Handling: Addressing natural follow-up questions
Detail Level: Appropriate depth for the query complexity

Tone and Style Analysis

Communication Quality:

function analyzeCommunicationStyle(response: string) {
  return {
    tone: detectTone(response), // friendly, professional, helpful
    clarity: assessClarity(response), // clear, confusing, verbose
    appropriateness: checkAppropriatenessForContext(response),
    engagement: measureEngagementLevel(response)
  };
}

function detectTone(text: string): string {
  // Simple keyword-based analysis
  const friendlyWords = ['please', 'thank you', 'happy to help'];
  const professionalWords = ['regarding', 'furthermore', 'consequently'];

  if (friendlyWords.some(word => text.toLowerCase().includes(word))) {
    return 'friendly';
  }
  if (professionalWords.some(word => text.toLowerCase().includes(word))) {
    return 'professional';
  }
  return 'neutral';
}

Composite Scoring

Weighted Evaluation

Combine multiple metrics for holistic assessment:

interface EvaluationWeights {
  trajectory: number;      // 0.4 - How agents reach solutions
  accuracy: number;        // 0.3 - Correctness of responses
  relevance: number;       // 0.2 - Appropriateness to query
  efficiency: number;      // 0.1 - Speed and resource usage
}

function calculateCompositeScore(
  trajectoryScore: number,
  accuracyScore: number,
  relevanceScore: number,
  efficiencyScore: number,
  weights: EvaluationWeights = {
    trajectory: 0.4,
    accuracy: 0.3,
    relevance: 0.2,
    efficiency: 0.1
  }
): number {
  return (
    trajectoryScore * weights.trajectory +
    accuracyScore * weights.accuracy +
    relevanceScore * weights.relevance +
    efficiencyScore * weights.efficiency
  );
}

Pass/Fail Criteria

Threshold-Based Evaluation:

interface QualityThresholds {
  trajectory: number;      // Minimum 0.85
  accuracy: number;        // Minimum 0.90
  relevance: number;       // Minimum 0.75
  overall: number;         // Minimum 0.80
}

function evaluateAgentPerformance(
  scores: Record<string, number>,
  thresholds: QualityThresholds
): { pass: boolean; details: Record<string, boolean> } {
  const details = {
    trajectory: scores.trajectory >= thresholds.trajectory,
    accuracy: scores.accuracy >= thresholds.accuracy,
    relevance: scores.relevance >= thresholds.relevance,
    overall: scores.overall >= thresholds.overall
  };

  const pass = Object.values(details).every(Boolean);

  return { pass, details };
}

Statistical Analysis

Confidence Intervals

Since agent responses vary, multiple runs provide statistical significance:

async function runMultipleEvaluations(
  agent: LlmAgent,
  query: string,
  runs: number = 5
) {
  const scores = [];

  for (let i = 0; i < runs; i++) {
    const result = await evaluateSingleRun(agent, query);
    scores.push(result.overallScore);
  }

  const mean = scores.reduce((a, b) => a + b) / scores.length;
  const variance = scores.reduce((a, b) => a + Math.pow(b - mean, 2), 0) / scores.length;
  const stdDev = Math.sqrt(variance);

  return {
    mean,
    stdDev,
    confidence95: [mean - 1.96 * stdDev, mean + 1.96 * stdDev],
    allScores: scores
  };
}

Trend Analysis

Track performance over time:

interface PerformanceRecord {
  timestamp: Date;
  version: string;
  scores: Record<string, number>;
  testScenario: string;
}

function analyzePerformanceTrend(records: PerformanceRecord[]) {
  const byVersion = records.reduce((acc, record) => {
    if (!acc[record.version]) acc[record.version] = [];
    acc[record.version].push(record.scores.overall);
    return acc;
  }, {} as Record<string, number[]>);

  return Object.entries(byVersion).map(([version, scores]) => ({
    version,
    avgScore: scores.reduce((a, b) => a + b) / scores.length,
    minScore: Math.min(...scores),
    maxScore: Math.max(...scores),
    sampleSize: scores.length
  }));
}

Domain-Specific Metrics

Customer Support Agents

Resolution Rate:

Metric: Percentage of issues successfully resolved
Measurement: Track escalation vs resolution outcomes
Target: 80%+ first-contact resolution

User Satisfaction Proxy:

Indicators: Response helpfulness, clarity, completeness
Measurement: Structured evaluation of response quality
Benchmarking: Compare against human agent performance

Information Retrieval Agents

Search Accuracy:

Precision: Relevant results / Total results returned
Recall: Relevant results found / Total relevant results available
F1 Score: Harmonic mean of precision and recall

Source Attribution:

Metric: Percentage of claims with proper citations
Quality: Accuracy and reliability of cited sources
Compliance: Adherence to attribution requirements

Task Automation Agents

Completion Rate:

Success: Tasks completed successfully without intervention
Efficiency: Time to completion vs baseline
Error Handling: Graceful handling of unexpected conditions

Integration Quality:

API Usage: Proper integration with external systems
Data Handling: Correct processing and transformation
Error Recovery: Resilience to system failures

Available Evaluation Metrics

Current Implementation

The following metrics are available in the ADK evaluation framework. Additional advanced analytics are planned for future releases.

Built-in Metric Keys (use in criteria):

Key	Description	Range	Typical Threshold
tool_trajectory_avg_score	Avg binary match of expected vs actual tool sequences	0–1	1.0 (strict)
response_match_score	ROUGE‑1 similarity F1	0–1	0.7–0.9
response_evaluation_score	LLM judge qualitative score	1–5	3–4
safety_v1	Safety / harmlessness (pass/fail mapped)	0–1	1.0
final_response_match_v2 (experimental)	LLM judged exactness vs reference	0–1	0.8–0.95

Criteria Configuration Example (test_config.json):

{
  "criteria": {
    "tool_trajectory_avg_score": 1.0,
    "response_match_score": 0.8,
    "safety_v1": 1.0
  }
}


**Future Capabilities:**
- **Real-time Analytics**: Live performance dashboards
- **Comparative Analysis**: Benchmark against other agents
- **Predictive Metrics**: Early warning systems for performance degradation

**Integration Capabilities:**
- **CI/CD Pipeline Integration**: Automated quality gates
- **Performance Monitoring**: Production quality tracking
- **Alert Systems**: Threshold-based notifications
- **Trend Analysis**: Long-term performance patterns

## Best Practices

### Metric Selection

**Relevance**: Choose metrics that align with business objectives
**Balance**: Combine objective measurements with subjective assessment
**Actionability**: Select metrics that guide improvement efforts
**Consistency**: Maintain consistent measurement approaches over time

### Threshold Setting

**Baseline Establishment**: Use initial performance as baseline
**Gradual Improvement**: Incrementally raise quality standards
**Context Sensitivity**: Adjust thresholds for different use cases
**Stakeholder Alignment**: Ensure thresholds match business requirements

### Performance Monitoring

**Regular Assessment**: Continuous evaluation of key metrics
**Trend Analysis**: Monitor performance patterns over time
**Root Cause Analysis**: Investigate performance degradations
**Improvement Tracking**: Measure impact of optimization efforts

Metrics and Scoring

On this page