Evaluation Patterns

Different types of agents require specialized evaluation approaches tailored to their specific use cases and performance requirements.

Customer Support Agents

Customer support agents need evaluation that focuses on issue resolution, user satisfaction, and escalation handling.

Key Evaluation Areas

Issue Resolution Effectiveness:

First Contact Resolution: Percentage of issues resolved without escalation
Resolution Accuracy: Correctness of solutions provided
Resolution Time: Speed of problem solving
Follow-up Requirements: Need for additional assistance

Communication Quality:

Tone Appropriateness: Professional yet empathetic communication
Clarity: Easy-to-understand explanations
Completeness: Comprehensive response coverage
Personalization: Appropriate use of user context

Knowledge Base Usage:

Search Accuracy: Finding relevant information efficiently
Information Synthesis: Combining multiple sources effectively
Source Attribution: Proper citation of knowledge sources
Gap Identification: Recognizing when information is unavailable

Evaluation Implementation

async function evaluateCustomerSupportAgent(
  agent: LlmAgent,
  supportScenarios: CustomerSupportScenario[]
) {
  const results = [];

  for (const scenario of supportScenarios) {
    const sessionService = new InMemorySessionService();
    const session = await sessionService.createSession("support", "customer");

    const runner = new Runner({
      appName: "support_app",
      agent,
      sessionService,
    });

    // Run the support scenario
    const events = [];
    for await (const event of runner.runAsync({
      userId: "customer",
      sessionId: session.id,
      newMessage: { parts: [{ text: scenario.userIssue }] },
    })) {
      events.push(event);
    }

    // Evaluate the interaction
    const evaluation = {
      scenario: scenario.type,
      resolved: checkIssueResolution(events, scenario),
      escalated: checkEscalation(events),
      tone: analyzeTone(events),
      accuracy: validateAccuracy(events, scenario.expectedSolution),
      toolsUsed: extractToolUsage(events)
    };

    results.push(evaluation);
  }

  return analyzeSupportResults(results);
}

interface CustomerSupportScenario {
  type: 'billing' | 'technical' | 'account' | 'general';
  userIssue: string;
  expectedSolution: string;
  shouldEscalate: boolean;
  requiredTools: string[];
}

Success Metrics

Resolution Metrics:

Target: 85%+ first-contact resolution for standard issues
Escalation Rate: Less than 15% for routine inquiries
Accuracy Rate: 95%+ for factual information

User Experience Metrics:

Response Clarity: Subjective assessment of explanation quality
Tone Appropriateness: Professional and empathetic communication
Completeness: All aspects of inquiry addressed

Task Automation Agents

Task automation agents require evaluation focused on workflow completion, integration quality, and error handling.

Key Evaluation Areas

Workflow Completion:

End-to-End Success: Complete task execution without intervention
Step Accuracy: Correct execution of individual workflow steps
State Management: Proper handling of workflow state
Dependency Handling: Managing task dependencies and prerequisites

Integration Quality:

API Usage: Proper integration with external systems
Data Transformation: Accurate processing and formatting
Error Handling: Graceful handling of system failures
Authentication: Proper credential and access management

Efficiency Metrics:

Execution Time: Speed of task completion
Resource Usage: Optimal use of system resources
Retry Logic: Intelligent handling of transient failures
Parallel Processing: Effective use of concurrent operations

Evaluation Implementation

interface AutomationTask {
  name: string;
  steps: TaskStep[];
  expectedDuration: number;
  requiredTools: string[];
  successCriteria: SuccessCriteria;
}

interface TaskStep {
  action: string;
  expectedTool: string;
  expectedData: any;
  dependencies: string[];
}

async function evaluateAutomationAgent(
  agent: LlmAgent,
  automationTasks: AutomationTask[]
) {
  const results = [];

  for (const task of automationTasks) {
    const startTime = Date.now();

    try {
      const events = await executeAutomationTask(agent, task);
      const executionTime = Date.now() - startTime;

      const evaluation = {
        taskName: task.name,
        completed: checkTaskCompletion(events, task),
        accuracy: validateStepAccuracy(events, task.steps),
        efficiency: executionTime <= task.expectedDuration,
        toolUsage: analyzeToolUsage(events, task.requiredTools),
        errorHandling: evaluateErrorRecovery(events),
        executionTime
      };

      results.push(evaluation);
    } catch (error) {
      results.push({
        taskName: task.name,
        completed: false,
        error: error.message,
        executionTime: Date.now() - startTime
      });
    }
  }

  return analyzeAutomationResults(results);
}

Success Metrics

Completion Metrics:

Target: 95%+ successful task completion
Efficiency: Within 120% of expected execution time
Error Recovery: 90%+ recovery from transient failures

Quality Metrics:

Step Accuracy: 98%+ correct step execution
Data Integrity: 100% accurate data transformation
Integration: Zero authentication or API errors

Information Retrieval Agents

Information retrieval agents need evaluation focused on search accuracy, source quality, and information synthesis.

Key Evaluation Areas

Search Effectiveness:

Query Understanding: Interpretation of user information needs
Search Strategy: Effective use of search tools and techniques
Result Relevance: Quality of retrieved information
Coverage: Comprehensive information gathering

Source Quality:

Reliability: Use of trustworthy information sources
Currency: Access to up-to-date information
Attribution: Proper citation and source tracking
Diversity: Multiple perspectives and sources

Information Synthesis:

Accuracy: Correct interpretation of source material
Coherence: Logical organization of information
Completeness: Comprehensive coverage of the topic
Bias Awareness: Recognition and mitigation of source bias

Evaluation Implementation

interface InformationQuery {
  query: string;
  domain: string;
  expectedSources: string[];
  requiredInformation: string[];
  qualityThreshold: number;
}

async function evaluateRetrievalAgent(
  agent: LlmAgent,
  queries: InformationQuery[]
) {
  const results = [];

  for (const query of queries) {
    const events = await executeInformationQuery(agent, query.query);

    const evaluation = {
      query: query.query,
      domain: query.domain,
      searchAccuracy: evaluateSearchAccuracy(events, query),
      sourceQuality: assessSourceQuality(events, query.expectedSources),
      informationCompleteness: checkInformationCoverage(events, query.requiredInformation),
      synthesisQuality: evaluateSynthesis(events),
      citations: validateCitations(events)
    };

    results.push(evaluation);
  }

  return analyzeRetrievalResults(results);
}

function evaluateSearchAccuracy(events: Event[], query: InformationQuery) {
  const searchEvents = events.filter(e =>
    e.getFunctionCalls().some(fc => fc.name.includes('search'))
  );

  const relevantResults = searchEvents.filter(e =>
    containsRelevantInformation(e, query.requiredInformation)
  );

  return {
    precision: relevantResults.length / searchEvents.length,
    recall: calculateRecall(relevantResults, query.requiredInformation),
    f1Score: calculateF1Score(relevantResults, query)
  };
}

Success Metrics

Accuracy Metrics:

Search Precision: 80%+ relevant results
Information Recall: 90%+ required information found
Source Reliability: 95%+ authoritative sources

Quality Metrics:

Citation Accuracy: 100% proper attribution
Synthesis Coherence: Subjective assessment of organization
Bias Mitigation: Recognition of perspective limitations

Multi-Agent System Evaluation

Multi-agent systems require evaluation of coordination, communication, and collective performance.

Key Evaluation Areas

Agent Coordination:

Task Distribution: Effective allocation of work across agents
Communication: Clear inter-agent information exchange
Synchronization: Proper timing of collaborative activities
Conflict Resolution: Handling of contradictory agent outputs

System Performance:

Collective Accuracy: Combined agent performance vs individual
Efficiency Gains: Benefits of parallelization and specialization
Redundancy Management: Optimal use of multiple agents
Fault Tolerance: System resilience to individual agent failures

Evaluation Implementation

async function evaluateMultiAgentSystem(
  agents: BaseAgent[],
  collaborativeScenarios: CollaborativeScenario[]
) {
  const results = [];

  for (const scenario of collaborativeScenarios) {
    // Create coordinating agent that manages the sub-agents
    const coordinator = new SequentialAgent({
      name: "coordinator",
      description: "Coordinates multiple specialized agents",
      subAgents: agents
    });

    const events = await executeCollaborativeTask(coordinator, scenario);

    const evaluation = {
      scenario: scenario.name,
      overallSuccess: checkCollaborativeSuccess(events, scenario),
      agentContributions: analyzeAgentContributions(events, agents),
      coordination: evaluateCoordination(events),
      efficiency: compareToBaselinePerformance(events, scenario),
      conflictResolution: assessConflictHandling(events)
    };

    results.push(evaluation);
  }

  return analyzeMultiAgentResults(results);
}

Production Evaluation Patterns

Continuous Monitoring

Real-time Quality Assessment:

class ProductionEvaluationMonitor {
  private qualityThresholds: QualityThresholds;
  private alertCallbacks: ((issue: QualityIssue) => void)[];

  async monitorAgentInteraction(
    agentResponse: Event,
    userFeedback?: UserFeedback
  ) {
    const quality = await this.assessResponseQuality(agentResponse);

    if (quality.overall < this.qualityThresholds.overall) {
      await this.triggerQualityAlert({
        type: 'quality_degradation',
        agent: agentResponse.author,
        score: quality.overall,
        threshold: this.qualityThresholds.overall,
        timestamp: new Date()
      });
    }

    if (userFeedback) {
      await this.incorporateUserFeedback(agentResponse, userFeedback);
    }
  }

  private async assessResponseQuality(response: Event) {
    return {
      relevance: await this.calculateRelevance(response),
      accuracy: await this.validateAccuracy(response),
      helpfulness: await this.assessHelpfulness(response),
      overall: 0 // computed composite score
    };
  }
}

A/B Testing

Comparative Agent Evaluation:

async function runAgentABTest(
  agentA: LlmAgent,
  agentB: LlmAgent,
  testScenarios: TestScenario[],
  trafficSplit: number = 0.5
) {
  const resultsA = [];
  const resultsB = [];

  for (const scenario of testScenarios) {
    const useAgentA = Math.random() < trafficSplit;
    const agent = useAgentA ? agentA : agentB;

    const result = await evaluateScenario(agent, scenario);

    if (useAgentA) {
      resultsA.push(result);
    } else {
      resultsB.push(result);
    }
  }

  return {
    agentA: calculateAggregateMetrics(resultsA),
    agentB: calculateAggregateMetrics(resultsB),
    significance: calculateStatisticalSignificance(resultsA, resultsB),
    recommendation: determineWinningAgent(resultsA, resultsB)
  };
}

Best Practices

Evaluation Design Principles

Domain Alignment:

Relevant Metrics: Choose metrics that matter for your specific use case
Realistic Scenarios: Test with scenarios that reflect real usage
Stakeholder Input: Include business requirements in evaluation criteria
User Perspective: Prioritize metrics that impact user experience

Comprehensive Coverage:

Happy Path: Test normal, expected interactions
Edge Cases: Include unusual and challenging scenarios
Error Conditions: Test resilience to failures and invalid inputs
Scale Testing: Evaluate performance under various load conditions

Implementation Guidelines

Automation Balance:

Quick Feedback: Automated tests for rapid development iteration
Human Judgment: Manual evaluation for subjective quality aspects
Hybrid Approach: Combine automated metrics with human validation
Progressive Enhancement: Start simple, add complexity gradually

Data Management:

Test Data Quality: Ensure test scenarios are accurate and representative
Version Control: Track evaluation data and criteria changes
Reproducibility: Maintain consistent testing environments
Privacy Compliance: Handle sensitive data appropriately in testing

Iterative Improvement

Start with basic evaluation patterns for your domain, then gradually add sophistication as you learn what metrics best predict real-world performance.

Integration Strategies

Development Workflow:

CI/CD Integration: Automated evaluation in deployment pipelines
Feature Development: Evaluation-driven feature development
Release Gates: Quality thresholds for production deployment
Performance Tracking: Continuous monitoring of key metrics

Team Collaboration:

Shared Understanding: Clear communication of evaluation criteria
Regular Review: Periodic assessment of evaluation effectiveness
Knowledge Sharing: Document lessons learned and best practices
Stakeholder Engagement: Regular updates on agent performance

The key to successful agent evaluation is choosing the right combination of metrics and approaches for your specific domain while maintaining focus on real-world performance and user satisfaction.

Evaluation Patterns

On this page