TypeScriptADK-TS
Evaluation

Metrics and Scoring

Measurement approaches for trajectory and response quality

Effective agent evaluation requires quantifiable metrics that capture both the quality of agent decisions and the value of their outputs.

Implemented Metrics

Currently implemented prebuilt metrics: tool_trajectory_avg_score, response_match_score (ROUGE‑1), response_evaluation_score (LLM judge 1‑5), safety_v1, and experimental final_response_match_v2 (LLM judge binary). Configure via criteria thresholds.

Trajectory Metrics

Trajectory metrics measure the quality of the decision-making process and tool usage patterns.

Tool Usage Analysis

Tool Selection Accuracy:

  • Metric: Percentage of appropriate tool choices for given tasks
  • Calculation: (Correct tool selections / Total tool selections) * 100
  • Threshold: Typically 85%+ for production agents

Tool Trajectory Average Score:

    • Purpose: Compares actual vs expected tool usage sequences
  • Scoring: Binary (1 for matches, 0 for mismatches)
  • Patterns:
    • Exact match: Perfect sequence alignment
    • In-order match: Correct order, allowing extra steps
    • Any-order match: Required tools used regardless of order

Current Implementation Example:

function calculateToolAccuracy(events: Event[], expectedTools: string[]) {
  const actualTools = events
    .flatMap(e => e.getFunctionCalls())
    .map(fc => fc.name);

  const correctSelections = expectedTools.filter(tool =>
    actualTools.includes(tool)
  ).length;

  const accuracy = (correctSelections / expectedTools.length) * 100;
  console.log(`Tool selection accuracy: ${accuracy}%`);

  return {
    accuracy,
    expected: expectedTools,
    actual: actualTools,
    correct: correctSelections
  };
}

Decision Quality Metrics

Step Efficiency:

  • Definition: Ratio of necessary steps to total steps taken
  • Formula: Necessary steps / Total steps
  • Target: Minimize unnecessary actions while maintaining effectiveness

Error Recovery Rate:

  • Measurement: Percentage of errors successfully recovered from
  • Calculation: (Successful recoveries / Total errors) * 100
  • Assessment: Track how agents handle tool failures and incorrect responses

Strategy Consistency:

  • Evaluation: Consistent approach across similar scenarios
  • Measurement: Variance in tool usage patterns for equivalent tasks
  • Goal: Minimize unnecessary strategy variation

Performance Metrics

Response Time Analysis:

async function measureResponseTime(agent: LlmAgent, query: string) {
  const startTime = Date.now();

  let responseComplete = false;
  const events = [];

  for await (const event of agent.runAsync(context)) {
    events.push(event);
    if (event.isFinalResponse()) {
      responseComplete = true;
      break;
    }
  }

  const totalTime = Date.now() - startTime;
  const toolCalls = events.flatMap(e => e.getFunctionCalls()).length;

  return {
    totalTime,
    toolCallCount: toolCalls,
    avgTimePerTool: toolCalls > 0 ? totalTime / toolCalls : 0,
    responseComplete
  };
}

Response Quality Metrics

Response quality metrics assess the value and appropriateness of agent outputs.

Semantic Similarity Scoring

ROUGE‑1 (response_match_score):

  • Key: response_match_score
  • Range: 0.0–1.0 (F1 overlap of unigrams)
  • Typical threshold: 0.7–0.9
  • Auto-applied when response_match_score present in criteria.

LLM Judge (response_evaluation_score):

  • Key: response_evaluation_score
  • Range: 1–5 (higher is better)
  • Provide threshold (e.g. 3.5) and optional judgeModelOptions (model + numSamples) when extending internals.

Tool Trajectory (tool_trajectory_avg_score):

  • Binary per case; average across runs.
  • Requires test samples with expected_tool_use (legacy path) or tool expectations embedded (future expansion).

Safety (safety_v1):

  • Integration for safety / harmlessness (returns pass/fail style score 0/1 internally mapped to threshold).

Experimental Final Response Match V2 (final_response_match_v2):

  • LLM judge binary validity vs reference; score in [0,1].
  • May change; not yet exposed by default criteria (add explicitly if needed).

Current Manual Assessment:

function assessResponseQuality(response: string, query: string) {
  const quality = {
    relevance: 0,
    completeness: 0,
    clarity: 0,
    accuracy: 0
  };

  // Relevance assessment
  if (responseAddressesQuery(response, query)) {
    quality.relevance = 1;
  }

  // Completeness check
  if (responseIsComplete(response, query)) {
    quality.completeness = 1;
  }

  // Clarity evaluation
  if (responseIsClear(response)) {
    quality.clarity = 1;
  }

  // Accuracy validation
  if (responseIsAccurate(response, query)) {
    quality.accuracy = 1;
  }

  const overallScore = Object.values(quality).reduce((a, b) => a + b) / 4;
  return { ...quality, overall: overallScore };
}

Content Analysis

Accuracy Assessment:

  • Factual Correctness: Verification against known ground truth
  • Consistency: Alignment with previous responses and established facts
  • Source Attribution: Proper citation when using external information

Relevance Measurement:

  • Query Alignment: How well response addresses the user question
  • Context Appropriateness: Suitable for the conversation context
  • User Intent: Fulfillment of underlying user needs

Completeness Evaluation:

  • Information Coverage: All necessary information provided
  • Follow-up Handling: Addressing natural follow-up questions
  • Detail Level: Appropriate depth for the query complexity

Tone and Style Analysis

Communication Quality:

function analyzeCommunicationStyle(response: string) {
  return {
    tone: detectTone(response), // friendly, professional, helpful
    clarity: assessClarity(response), // clear, confusing, verbose
    appropriateness: checkAppropriatenessForContext(response),
    engagement: measureEngagementLevel(response)
  };
}

function detectTone(text: string): string {
  // Simple keyword-based analysis
  const friendlyWords = ['please', 'thank you', 'happy to help'];
  const professionalWords = ['regarding', 'furthermore', 'consequently'];

  if (friendlyWords.some(word => text.toLowerCase().includes(word))) {
    return 'friendly';
  }
  if (professionalWords.some(word => text.toLowerCase().includes(word))) {
    return 'professional';
  }
  return 'neutral';
}

Composite Scoring

Weighted Evaluation

Combine multiple metrics for holistic assessment:

interface EvaluationWeights {
  trajectory: number;      // 0.4 - How agents reach solutions
  accuracy: number;        // 0.3 - Correctness of responses
  relevance: number;       // 0.2 - Appropriateness to query
  efficiency: number;      // 0.1 - Speed and resource usage
}

function calculateCompositeScore(
  trajectoryScore: number,
  accuracyScore: number,
  relevanceScore: number,
  efficiencyScore: number,
  weights: EvaluationWeights = {
    trajectory: 0.4,
    accuracy: 0.3,
    relevance: 0.2,
    efficiency: 0.1
  }
): number {
  return (
    trajectoryScore * weights.trajectory +
    accuracyScore * weights.accuracy +
    relevanceScore * weights.relevance +
    efficiencyScore * weights.efficiency
  );
}

Pass/Fail Criteria

Threshold-Based Evaluation:

interface QualityThresholds {
  trajectory: number;      // Minimum 0.85
  accuracy: number;        // Minimum 0.90
  relevance: number;       // Minimum 0.75
  overall: number;         // Minimum 0.80
}

function evaluateAgentPerformance(
  scores: Record<string, number>,
  thresholds: QualityThresholds
): { pass: boolean; details: Record<string, boolean> } {
  const details = {
    trajectory: scores.trajectory >= thresholds.trajectory,
    accuracy: scores.accuracy >= thresholds.accuracy,
    relevance: scores.relevance >= thresholds.relevance,
    overall: scores.overall >= thresholds.overall
  };

  const pass = Object.values(details).every(Boolean);

  return { pass, details };
}

Statistical Analysis

Confidence Intervals

Since agent responses vary, multiple runs provide statistical significance:

async function runMultipleEvaluations(
  agent: LlmAgent,
  query: string,
  runs: number = 5
) {
  const scores = [];

  for (let i = 0; i < runs; i++) {
    const result = await evaluateSingleRun(agent, query);
    scores.push(result.overallScore);
  }

  const mean = scores.reduce((a, b) => a + b) / scores.length;
  const variance = scores.reduce((a, b) => a + Math.pow(b - mean, 2), 0) / scores.length;
  const stdDev = Math.sqrt(variance);

  return {
    mean,
    stdDev,
    confidence95: [mean - 1.96 * stdDev, mean + 1.96 * stdDev],
    allScores: scores
  };
}

Trend Analysis

Track performance over time:

interface PerformanceRecord {
  timestamp: Date;
  version: string;
  scores: Record<string, number>;
  testScenario: string;
}

function analyzePerformanceTrend(records: PerformanceRecord[]) {
  const byVersion = records.reduce((acc, record) => {
    if (!acc[record.version]) acc[record.version] = [];
    acc[record.version].push(record.scores.overall);
    return acc;
  }, {} as Record<string, number[]>);

  return Object.entries(byVersion).map(([version, scores]) => ({
    version,
    avgScore: scores.reduce((a, b) => a + b) / scores.length,
    minScore: Math.min(...scores),
    maxScore: Math.max(...scores),
    sampleSize: scores.length
  }));
}

Domain-Specific Metrics

Customer Support Agents

Resolution Rate:

  • Metric: Percentage of issues successfully resolved
  • Measurement: Track escalation vs resolution outcomes
  • Target: 80%+ first-contact resolution

User Satisfaction Proxy:

  • Indicators: Response helpfulness, clarity, completeness
  • Measurement: Structured evaluation of response quality
  • Benchmarking: Compare against human agent performance

Information Retrieval Agents

Search Accuracy:

  • Precision: Relevant results / Total results returned
  • Recall: Relevant results found / Total relevant results available
  • F1 Score: Harmonic mean of precision and recall

Source Attribution:

  • Metric: Percentage of claims with proper citations
  • Quality: Accuracy and reliability of cited sources
  • Compliance: Adherence to attribution requirements

Task Automation Agents

Completion Rate:

  • Success: Tasks completed successfully without intervention
  • Efficiency: Time to completion vs baseline
  • Error Handling: Graceful handling of unexpected conditions

Integration Quality:

  • API Usage: Proper integration with external systems
  • Data Handling: Correct processing and transformation
  • Error Recovery: Resilience to system failures

Available Evaluation Metrics

Current Implementation

The following metrics are available in the ADK evaluation framework. Additional advanced analytics are planned for future releases.

Built-in Metric Keys (use in criteria):

KeyDescriptionRangeTypical Threshold
tool_trajectory_avg_scoreAvg binary match of expected vs actual tool sequences0–11.0 (strict)
response_match_scoreROUGE‑1 similarity F10–10.7–0.9
response_evaluation_scoreLLM judge qualitative score1–53–4
safety_v1Safety / harmlessness (pass/fail mapped)0–11.0
final_response_match_v2 (experimental)LLM judged exactness vs reference0–10.8–0.95

Criteria Configuration Example (test_config.json):

{
  "criteria": {
    "tool_trajectory_avg_score": 1.0,
    "response_match_score": 0.8,
    "safety_v1": 1.0
  }
}


**Future Capabilities:**
- **Real-time Analytics**: Live performance dashboards
- **Comparative Analysis**: Benchmark against other agents
- **Predictive Metrics**: Early warning systems for performance degradation

**Integration Capabilities:**
- **CI/CD Pipeline Integration**: Automated quality gates
- **Performance Monitoring**: Production quality tracking
- **Alert Systems**: Threshold-based notifications
- **Trend Analysis**: Long-term performance patterns

## Best Practices

### Metric Selection

**Relevance**: Choose metrics that align with business objectives
**Balance**: Combine objective measurements with subjective assessment
**Actionability**: Select metrics that guide improvement efforts
**Consistency**: Maintain consistent measurement approaches over time

### Threshold Setting

**Baseline Establishment**: Use initial performance as baseline
**Gradual Improvement**: Incrementally raise quality standards
**Context Sensitivity**: Adjust thresholds for different use cases
**Stakeholder Alignment**: Ensure thresholds match business requirements

### Performance Monitoring

**Regular Assessment**: Continuous evaluation of key metrics
**Trend Analysis**: Monitor performance patterns over time
**Root Cause Analysis**: Investigate performance degradations
**Improvement Tracking**: Measure impact of optimization efforts

How is this guide?