TypeScriptADK-TS
Evaluation

Testing Agents

Current approaches and future automated testing methods

Agent testing involves validating behavior across multiple dimensions from tool usage to response quality. Currently, @iqai/adk supports manual testing approaches with automated frameworks coming soon.

Current Testing Approaches

Manual Testing with Examples

The most immediate way to test agents is through the provided examples and custom test scripts:

import { LlmAgent, Runner, InMemorySessionService } from '@iqai/adk';

async function testAgent() {
  const agent = new LlmAgent({
    name: "test_agent",
    model: "gemini-2.5-flash",
    description: "Agent for testing",
    instruction: "You are a helpful assistant",
  });

  const sessionService = new InMemorySessionService();
  const session = await sessionService.createSession("test_app", "test_user");

  const runner = new Runner({
    appName: "test_app",
    agent,
    sessionService,
  });

  // Test specific scenarios
  await testScenario(runner, session.id, "What is 2 + 2?");
  await testScenario(runner, session.id, "Help me plan a meeting");
}

async function testScenario(runner: Runner, sessionId: string, query: string) {
  console.log(`Testing: ${query}`);

  const events = [];
  for await (const event of runner.runAsync({
    userId: "test_user",
    sessionId,
    newMessage: { parts: [{ text: query }] },
  })) {
    events.push(event);
    if (event.content?.parts && !event.partial) {
      console.log(`Response: ${event.content.parts.map(p => p.text).join('')}`);
    }
  }

  // Manual validation
  validateResponse(events, query);
}

function validateResponse(events: Event[], query: string) {
  // Custom validation logic
  const finalEvents = events.filter(e => e.isFinalResponse());
  console.log(`Generated ${finalEvents.length} final responses`);

  // Check for specific criteria
  if (query.includes("plan")) {
    // Verify planning behavior
  }

  if (query.includes("calculate")) {
    // Verify calculation accuracy
  }
}

Session-Based Testing

Use session services to test multi-turn conversations:

async function testConversationFlow() {
  const sessionService = new InMemorySessionService();
  const session = await sessionService.createSession("test_app", "user_123");

  const testTurns = [
    "Hello, I need help with my project",
    "It's a web development project using React",
    "What are the best practices for state management?",
    "Can you provide specific examples?"
  ];

  for (const turn of testTurns) {
    console.log(`\n--- Turn: ${turn} ---`);
    await testScenario(runner, session.id, turn);

    // Analyze session state
    const updatedSession = await sessionService.getSession("test_app", "user_123", session.id);
    console.log(`Session has ${updatedSession?.events?.length || 0} events`);
  }
}

Tool Usage Validation

Test agent tool interactions by monitoring function calls:

function validateToolUsage(events: Event[]) {
  const functionCalls = events
    .flatMap(e => e.getFunctionCalls())
    .map(fc => ({ name: fc.name, args: fc.args }));

  console.log('Function calls made:', functionCalls);

  // Validate expected tool usage
  const expectedTools = ['search_tool', 'calculator_tool'];
  const usedTools = functionCalls.map(fc => fc.name);

  for (const tool of expectedTools) {
    if (usedTools.includes(tool)) {
      console.log(`✅ Expected tool '${tool}' was used`);
    } else {
      console.log(`❌ Expected tool '${tool}' was NOT used`);
    }
  }
}

Testing Strategies

Unit Testing Patterns

Test individual agent capabilities in isolation:

describe('Agent Capabilities', () => {
  test('should handle basic calculations', async () => {
    const response = await runSingleQuery(agent, "What is 15 * 7?");
    expect(response).toContain('105');
  });

  test('should use search tool for information queries', async () => {
    const events = await runAndCollectEvents(agent, "What is the weather in Paris?");
    const toolCalls = events.flatMap(e => e.getFunctionCalls());
    expect(toolCalls.some(fc => fc.name === 'search_tool')).toBe(true);
  });

  test('should maintain context across turns', async () => {
    const session = await createTestSession();
    await runQuery(session, "My name is Alice");
    const response = await runQuery(session, "What is my name?");
    expect(response.toLowerCase()).toContain('alice');
  });
});

Integration Testing

Test complete workflows and agent interactions:

async function testWorkflowIntegration() {
  // Test end-to-end workflow
  const workflow = [
    { query: "I need to book a flight", expectedTools: ['search_flights'] },
    { query: "From New York to London on March 15", expectedTools: ['book_flight'] },
    { query: "Send confirmation to my email", expectedTools: ['send_email'] }
  ];

  for (const step of workflow) {
    const events = await runAndCollectEvents(agent, step.query);
    validateToolUsage(events, step.expectedTools);
  }
}

Performance Testing

Monitor response times and resource usage:

async function testPerformance() {
  const queries = [
    "Simple question",
    "Complex multi-step query with calculations",
    "Query requiring multiple tool calls"
  ];

  for (const query of queries) {
    const startTime = Date.now();
    await runSingleQuery(agent, query);
    const duration = Date.now() - startTime;

    console.log(`Query: "${query}" took ${duration}ms`);

    // Set performance expectations
    if (duration > 10000) {
      console.warn(`⚠️  Query took longer than expected: ${duration}ms`);
    }
  }
}

Coming Soon: Automated Evaluation

Future Framework

The comprehensive automated evaluation framework is under development and will provide structured testing capabilities.

Unit Testing with Test Files

Rapid evaluation during active development:

Planned Features:

  • Individual JSON test files with expected behaviors
  • Fast execution for quick feedback loops
  • Integration with development workflows
  • Automated regression detection

Test File Structure (Coming Soon):

{
  "name": "basic_calculation_test",
  "agent_config": {
    "name": "calculator_agent",
    "tools": ["calculator_tool"]
  },
  "test_cases": [
    {
      "user_query": "What is 42 + 17?",
      "expected_tools": ["calculator_tool"],
      "expected_response_contains": ["59"],
      "max_response_time_ms": 5000
    }
  ]
}

Integration Testing with EvalSets

Comprehensive evaluation of complex agent behaviors:

Planned Features:

  • Multi-turn conversation scenarios
  • Complex workflow validation
  • Batch processing of evaluation scenarios
  • Statistical analysis of results

EvalSet Structure (Coming Soon):

{
  "evalset_name": "customer_support_scenarios",
  "sessions": [
    {
      "initial_state": { "user_type": "premium" },
      "conversation_turns": [
        {
          "user_message": "I have an issue with my account",
          "expected_agent_behavior": {
            "should_escalate": false,
            "should_use_tools": ["account_lookup"],
            "response_tone": "helpful"
          }
        }
      ]
    }
  ]
}

Automated Evaluation Classes

AgentEvaluator (Coming Soon):

  • Comprehensive agent performance assessment
  • Configurable evaluation criteria
  • Integration with CI/CD pipelines
  • Performance trend analysis

TrajectoryEvaluator (Coming Soon):

  • Tool usage pattern analysis
  • Decision path optimization
  • Efficiency measurement
  • Strategy consistency validation

ResponseEvaluator (Coming Soon):

  • Semantic similarity scoring
  • Quality assessment metrics
  • Reference comparison
  • Automated scoring algorithms

Best Practices

Test Design

Representative Scenarios:

  • Cover typical user journeys
  • Include edge cases and error conditions
  • Test various complexity levels
  • Validate cross-functional scenarios

Clear Success Criteria:

  • Define specific, measurable expectations
  • Set appropriate performance thresholds
  • Balance automation with human judgment
  • Document evaluation rationale

Test Maintenance

Version Control:

  • Track test scenarios and expected outcomes
  • Maintain test data consistency
  • Document evaluation methodology
  • Regular review and updates

Continuous Improvement:

  • Analyze test failures systematically
  • Update tests based on production feedback
  • Expand coverage as agents evolve
  • Balance comprehensive testing with execution speed

Integration Guidelines

Development Workflow:

  • Run quick tests during development
  • Comprehensive tests before deployment
  • Automated regression testing
  • Performance monitoring in production

Team Collaboration:

  • Share test scenarios across team members
  • Document testing conventions
  • Regular review of testing strategy
  • Clear escalation procedures for test failures

Getting Started

Start with manual testing using the examples, then gradually build custom test scripts as your agent functionality grows. This provides immediate validation while preparing for the upcoming automated framework.