TypeScriptADK-TS
Evaluation

Evaluation Patterns

Domain-specific evaluation strategies and best practices

Different types of agents require specialized evaluation approaches tailored to their specific use cases and performance requirements.

Customer Support Agents

Customer support agents need evaluation that focuses on issue resolution, user satisfaction, and escalation handling.

Key Evaluation Areas

Issue Resolution Effectiveness:

  • First Contact Resolution: Percentage of issues resolved without escalation
  • Resolution Accuracy: Correctness of solutions provided
  • Resolution Time: Speed of problem solving
  • Follow-up Requirements: Need for additional assistance

Communication Quality:

  • Tone Appropriateness: Professional yet empathetic communication
  • Clarity: Easy-to-understand explanations
  • Completeness: Comprehensive response coverage
  • Personalization: Appropriate use of user context

Knowledge Base Usage:

  • Search Accuracy: Finding relevant information efficiently
  • Information Synthesis: Combining multiple sources effectively
  • Source Attribution: Proper citation of knowledge sources
  • Gap Identification: Recognizing when information is unavailable

Evaluation Implementation

async function evaluateCustomerSupportAgent(
  agent: LlmAgent,
  supportScenarios: CustomerSupportScenario[]
) {
  const results = [];

  for (const scenario of supportScenarios) {
    const sessionService = new InMemorySessionService();
    const session = await sessionService.createSession("support", "customer");

    const runner = new Runner({
      appName: "support_app",
      agent,
      sessionService,
    });

    // Run the support scenario
    const events = [];
    for await (const event of runner.runAsync({
      userId: "customer",
      sessionId: session.id,
      newMessage: { parts: [{ text: scenario.userIssue }] },
    })) {
      events.push(event);
    }

    // Evaluate the interaction
    const evaluation = {
      scenario: scenario.type,
      resolved: checkIssueResolution(events, scenario),
      escalated: checkEscalation(events),
      tone: analyzeTone(events),
      accuracy: validateAccuracy(events, scenario.expectedSolution),
      toolsUsed: extractToolUsage(events)
    };

    results.push(evaluation);
  }

  return analyzeSupportResults(results);
}

interface CustomerSupportScenario {
  type: 'billing' | 'technical' | 'account' | 'general';
  userIssue: string;
  expectedSolution: string;
  shouldEscalate: boolean;
  requiredTools: string[];
}

Success Metrics

Resolution Metrics:

  • Target: 85%+ first-contact resolution for standard issues
  • Escalation Rate: Less than 15% for routine inquiries
  • Accuracy Rate: 95%+ for factual information

User Experience Metrics:

  • Response Clarity: Subjective assessment of explanation quality
  • Tone Appropriateness: Professional and empathetic communication
  • Completeness: All aspects of inquiry addressed

Task Automation Agents

Task automation agents require evaluation focused on workflow completion, integration quality, and error handling.

Key Evaluation Areas

Workflow Completion:

  • End-to-End Success: Complete task execution without intervention
  • Step Accuracy: Correct execution of individual workflow steps
  • State Management: Proper handling of workflow state
  • Dependency Handling: Managing task dependencies and prerequisites

Integration Quality:

  • API Usage: Proper integration with external systems
  • Data Transformation: Accurate processing and formatting
  • Error Handling: Graceful handling of system failures
  • Authentication: Proper credential and access management

Efficiency Metrics:

  • Execution Time: Speed of task completion
  • Resource Usage: Optimal use of system resources
  • Retry Logic: Intelligent handling of transient failures
  • Parallel Processing: Effective use of concurrent operations

Evaluation Implementation

interface AutomationTask {
  name: string;
  steps: TaskStep[];
  expectedDuration: number;
  requiredTools: string[];
  successCriteria: SuccessCriteria;
}

interface TaskStep {
  action: string;
  expectedTool: string;
  expectedData: any;
  dependencies: string[];
}

async function evaluateAutomationAgent(
  agent: LlmAgent,
  automationTasks: AutomationTask[]
) {
  const results = [];

  for (const task of automationTasks) {
    const startTime = Date.now();

    try {
      const events = await executeAutomationTask(agent, task);
      const executionTime = Date.now() - startTime;

      const evaluation = {
        taskName: task.name,
        completed: checkTaskCompletion(events, task),
        accuracy: validateStepAccuracy(events, task.steps),
        efficiency: executionTime <= task.expectedDuration,
        toolUsage: analyzeToolUsage(events, task.requiredTools),
        errorHandling: evaluateErrorRecovery(events),
        executionTime
      };

      results.push(evaluation);
    } catch (error) {
      results.push({
        taskName: task.name,
        completed: false,
        error: error.message,
        executionTime: Date.now() - startTime
      });
    }
  }

  return analyzeAutomationResults(results);
}

Success Metrics

Completion Metrics:

  • Target: 95%+ successful task completion
  • Efficiency: Within 120% of expected execution time
  • Error Recovery: 90%+ recovery from transient failures

Quality Metrics:

  • Step Accuracy: 98%+ correct step execution
  • Data Integrity: 100% accurate data transformation
  • Integration: Zero authentication or API errors

Information Retrieval Agents

Information retrieval agents need evaluation focused on search accuracy, source quality, and information synthesis.

Key Evaluation Areas

Search Effectiveness:

  • Query Understanding: Interpretation of user information needs
  • Search Strategy: Effective use of search tools and techniques
  • Result Relevance: Quality of retrieved information
  • Coverage: Comprehensive information gathering

Source Quality:

  • Reliability: Use of trustworthy information sources
  • Currency: Access to up-to-date information
  • Attribution: Proper citation and source tracking
  • Diversity: Multiple perspectives and sources

Information Synthesis:

  • Accuracy: Correct interpretation of source material
  • Coherence: Logical organization of information
  • Completeness: Comprehensive coverage of the topic
  • Bias Awareness: Recognition and mitigation of source bias

Evaluation Implementation

interface InformationQuery {
  query: string;
  domain: string;
  expectedSources: string[];
  requiredInformation: string[];
  qualityThreshold: number;
}

async function evaluateRetrievalAgent(
  agent: LlmAgent,
  queries: InformationQuery[]
) {
  const results = [];

  for (const query of queries) {
    const events = await executeInformationQuery(agent, query.query);

    const evaluation = {
      query: query.query,
      domain: query.domain,
      searchAccuracy: evaluateSearchAccuracy(events, query),
      sourceQuality: assessSourceQuality(events, query.expectedSources),
      informationCompleteness: checkInformationCoverage(events, query.requiredInformation),
      synthesisQuality: evaluateSynthesis(events),
      citations: validateCitations(events)
    };

    results.push(evaluation);
  }

  return analyzeRetrievalResults(results);
}

function evaluateSearchAccuracy(events: Event[], query: InformationQuery) {
  const searchEvents = events.filter(e =>
    e.getFunctionCalls().some(fc => fc.name.includes('search'))
  );

  const relevantResults = searchEvents.filter(e =>
    containsRelevantInformation(e, query.requiredInformation)
  );

  return {
    precision: relevantResults.length / searchEvents.length,
    recall: calculateRecall(relevantResults, query.requiredInformation),
    f1Score: calculateF1Score(relevantResults, query)
  };
}

Success Metrics

Accuracy Metrics:

  • Search Precision: 80%+ relevant results
  • Information Recall: 90%+ required information found
  • Source Reliability: 95%+ authoritative sources

Quality Metrics:

  • Citation Accuracy: 100% proper attribution
  • Synthesis Coherence: Subjective assessment of organization
  • Bias Mitigation: Recognition of perspective limitations

Multi-Agent System Evaluation

Multi-agent systems require evaluation of coordination, communication, and collective performance.

Key Evaluation Areas

Agent Coordination:

  • Task Distribution: Effective allocation of work across agents
  • Communication: Clear inter-agent information exchange
  • Synchronization: Proper timing of collaborative activities
  • Conflict Resolution: Handling of contradictory agent outputs

System Performance:

  • Collective Accuracy: Combined agent performance vs individual
  • Efficiency Gains: Benefits of parallelization and specialization
  • Redundancy Management: Optimal use of multiple agents
  • Fault Tolerance: System resilience to individual agent failures

Evaluation Implementation

async function evaluateMultiAgentSystem(
  agents: BaseAgent[],
  collaborativeScenarios: CollaborativeScenario[]
) {
  const results = [];

  for (const scenario of collaborativeScenarios) {
    // Create coordinating agent that manages the sub-agents
    const coordinator = new SequentialAgent({
      name: "coordinator",
      description: "Coordinates multiple specialized agents",
      subAgents: agents
    });

    const events = await executeCollaborativeTask(coordinator, scenario);

    const evaluation = {
      scenario: scenario.name,
      overallSuccess: checkCollaborativeSuccess(events, scenario),
      agentContributions: analyzeAgentContributions(events, agents),
      coordination: evaluateCoordination(events),
      efficiency: compareToBaselinePerformance(events, scenario),
      conflictResolution: assessConflictHandling(events)
    };

    results.push(evaluation);
  }

  return analyzeMultiAgentResults(results);
}

Production Evaluation Patterns

Continuous Monitoring

Real-time Quality Assessment:

class ProductionEvaluationMonitor {
  private qualityThresholds: QualityThresholds;
  private alertCallbacks: ((issue: QualityIssue) => void)[];

  async monitorAgentInteraction(
    agentResponse: Event,
    userFeedback?: UserFeedback
  ) {
    const quality = await this.assessResponseQuality(agentResponse);

    if (quality.overall < this.qualityThresholds.overall) {
      await this.triggerQualityAlert({
        type: 'quality_degradation',
        agent: agentResponse.author,
        score: quality.overall,
        threshold: this.qualityThresholds.overall,
        timestamp: new Date()
      });
    }

    if (userFeedback) {
      await this.incorporateUserFeedback(agentResponse, userFeedback);
    }
  }

  private async assessResponseQuality(response: Event) {
    return {
      relevance: await this.calculateRelevance(response),
      accuracy: await this.validateAccuracy(response),
      helpfulness: await this.assessHelpfulness(response),
      overall: 0 // computed composite score
    };
  }
}

A/B Testing

Comparative Agent Evaluation:

async function runAgentABTest(
  agentA: LlmAgent,
  agentB: LlmAgent,
  testScenarios: TestScenario[],
  trafficSplit: number = 0.5
) {
  const resultsA = [];
  const resultsB = [];

  for (const scenario of testScenarios) {
    const useAgentA = Math.random() < trafficSplit;
    const agent = useAgentA ? agentA : agentB;

    const result = await evaluateScenario(agent, scenario);

    if (useAgentA) {
      resultsA.push(result);
    } else {
      resultsB.push(result);
    }
  }

  return {
    agentA: calculateAggregateMetrics(resultsA),
    agentB: calculateAggregateMetrics(resultsB),
    significance: calculateStatisticalSignificance(resultsA, resultsB),
    recommendation: determineWinningAgent(resultsA, resultsB)
  };
}

Best Practices

Evaluation Design Principles

Domain Alignment:

  • Relevant Metrics: Choose metrics that matter for your specific use case
  • Realistic Scenarios: Test with scenarios that reflect real usage
  • Stakeholder Input: Include business requirements in evaluation criteria
  • User Perspective: Prioritize metrics that impact user experience

Comprehensive Coverage:

  • Happy Path: Test normal, expected interactions
  • Edge Cases: Include unusual and challenging scenarios
  • Error Conditions: Test resilience to failures and invalid inputs
  • Scale Testing: Evaluate performance under various load conditions

Implementation Guidelines

Automation Balance:

  • Quick Feedback: Automated tests for rapid development iteration
  • Human Judgment: Manual evaluation for subjective quality aspects
  • Hybrid Approach: Combine automated metrics with human validation
  • Progressive Enhancement: Start simple, add complexity gradually

Data Management:

  • Test Data Quality: Ensure test scenarios are accurate and representative
  • Version Control: Track evaluation data and criteria changes
  • Reproducibility: Maintain consistent testing environments
  • Privacy Compliance: Handle sensitive data appropriately in testing

Iterative Improvement

Start with basic evaluation patterns for your domain, then gradually add sophistication as you learn what metrics best predict real-world performance.

Integration Strategies

Development Workflow:

  • CI/CD Integration: Automated evaluation in deployment pipelines
  • Feature Development: Evaluation-driven feature development
  • Release Gates: Quality thresholds for production deployment
  • Performance Tracking: Continuous monitoring of key metrics

Team Collaboration:

  • Shared Understanding: Clear communication of evaluation criteria
  • Regular Review: Periodic assessment of evaluation effectiveness
  • Knowledge Sharing: Document lessons learned and best practices
  • Stakeholder Engagement: Regular updates on agent performance

The key to successful agent evaluation is choosing the right combination of metrics and approaches for your specific domain while maintaining focus on real-world performance and user satisfaction.