Privacy controls, performance tuning, best practices, and troubleshooting for observability

This guide covers production-ready configuration for observability in ADK-TS, including privacy controls, performance tuning, best practices, and troubleshooting.

Production Configuration

Here's a complete production-ready configuration:

import { telemetryService } from "@iqai/adk";

await telemetryService.initialize({
  // Required
  appName: process.env.OTEL_SERVICE_NAME || "my-agent-app",
  otlpEndpoint:
    process.env.OTEL_EXPORTER_OTLP_ENDPOINT ||
    "https://your-backend.com/v1/traces",

  // Environment
  environment: process.env.NODE_ENV || "production",
  appVersion: process.env.APP_VERSION || "1.0.0",

  // Authentication
  otlpHeaders: {
    "api-key": process.env.OTEL_API_KEY || "",
  },

  // Feature flags
  enableTracing: true,
  enableMetrics: true,
  enableAutoInstrumentation: false,

  // Privacy and performance
  captureMessageContent: false,
  samplingRatio: 0.1,
  metricExportIntervalMs: 300000,

  // Custom attributes
  resourceAttributes: {
    "deployment.name": process.env.DEPLOYMENT_NAME || "production",
    team: process.env.TEAM_NAME || "platform",
  },
});

Environment Variables

# Service identification
export OTEL_SERVICE_NAME=my-agent-app
export APP_VERSION=1.0.0

# Privacy
export ADK_CAPTURE_MESSAGE_CONTENT=false

# Performance
export OTEL_SAMPLING_RATIO=0.1
export METRIC_EXPORT_INTERVAL_MS=300000

# OTLP endpoint
export OTEL_EXPORTER_OTLP_ENDPOINT=https://your-backend.com/v1/traces
export OTEL_API_KEY=your-api-key

# Environment
export NODE_ENV=production

Configuration Reference

Setting	Development	Production	Description
`captureMessageContent`	`true`	`false`	Capture LLM prompts/completions and tool arguments
`samplingRatio`	`1.0`	`0.1`–`0.2`	Percentage of traces to sample (0.0-1.0)
`metricExportIntervalMs`	`60000`	`300000`–`600000`	Metric export interval in milliseconds
`enableAutoInstrumentation`	`false`	`false`	HTTP/database auto-tracing (enable only if needed)

Privacy in Production

Always set captureMessageContent: false in production to protect user data and comply with privacy regulations.

Best Practices

Initialize Before Agent Operations

Always initialize telemetry before creating or running agents:

// ✅ Correct order
await telemetryService.initialize({
  /* config */
});
const agent = AgentBuilder.withModel("gemini-2.5-flash").build();

// ❌ Wrong - traces may be missed
const agent = AgentBuilder.withModel("gemini-2.5-flash").build();
await telemetryService.initialize({
  /* config */
});

Disable Content Capture

Protect user privacy by disabling content capture in production:

export ADK_CAPTURE_MESSAGE_CONTENT=false

await telemetryService.initialize({
  captureMessageContent: false,
});

When disabled, tool arguments, tool responses, LLM prompts, and completions are not recorded. Metadata like model name, token counts, and duration are still captured.

Use Environment Variables for Secrets

Never hardcode API keys or endpoints:

// ✅ Good - use environment variables
otlpHeaders: {
  "api-key": process.env.OTEL_API_KEY,
},

// ❌ Bad - hardcoded secret
otlpHeaders: {
  "api-key": "sk-abc123...",
},

Implement Graceful Shutdown

Always shutdown telemetry to flush pending data:

process.on("SIGTERM", async () => {
  await telemetryService.shutdown(5000);
  process.exit(0);
});

process.on("SIGINT", async () => {
  await telemetryService.shutdown(5000);
  process.exit(0);
});

The timeout (5000ms) ensures telemetry is flushed even if the backend is slow. Adjust based on your network conditions.

Optimize Sampling for High Traffic

Reduce sampling ratio to minimize overhead in high-traffic scenarios:

await telemetryService.initialize({
  samplingRatio: 0.1, // Sample 10% of traces
});

Recommended sampling ratios:

Development: 1.0 (100%)
Staging: 0.5 (50%)
Production: 0.1–0.2 (10-20%)

Adjust Metric Export Interval

Export metrics less frequently in production to reduce overhead:

await telemetryService.initialize({
  metricExportIntervalMs: 300000, // Export every 5 minutes
});

Recommended intervals:

Development: 60000 (1 minute)
Production: 300000–600000 (5-10 minutes)

Use HTTPS for OTLP Endpoints

Always use HTTPS in production:

// ✅ Good
otlpEndpoint: "https://your-backend.com/v1/traces",

// ❌ Bad for production
otlpEndpoint: "http://your-backend.com/v1/traces",

Follow OpenTelemetry Semantic Conventions

Use standard attribute names for consistency:

import { SEMCONV, ADK_ATTRS } from "@iqai/adk";

span.setAttribute(SEMCONV.GEN_AI_REQUEST_MODEL, "gpt-4");
span.setAttribute(ADK_ATTRS.SESSION_ID, sessionId);

Add Business Context with Custom Attributes

Include custom attributes relevant to your domain:

resourceAttributes: {
  "deployment.name": "production",
  "team": "platform",
  "region": "us-east-1",
  "customer.tier": "enterprise",
},

Record Exceptions in Catch Blocks

Always record exceptions with context:

try {
  await riskyOperation();
} catch (error) {
  telemetryService.recordException(error as Error, {
    "error.context": "data_validation",
    "error.severity": "high",
  });
  throw error;
}

Create Focused Spans

Create separate spans for distinct operations:

// ✅ Good - focused spans
await telemetryService.withSpan("fetch_data", async () => {
  return await fetchData();
});

await telemetryService.withSpan("process_data", async () => {
  return await processData();
});

// ❌ Bad - one span for everything
await telemetryService.withSpan("do_everything", async () => {
  await fetchData();
  await processData();
  await saveData();
});

Set Up Alerts for Critical Metrics

Create alerts for error rates and latency thresholds:

Error rate exceeds 5%
95th percentile latency exceeds 2 seconds
Token usage exceeds budget
Sampling rate drops below threshold

Monitor Data Ingestion Costs

Be aware of data ingestion costs for paid platforms. Use sampling and export intervals to control costs.

Configure Data Retention Policies

Set appropriate retention policies in your observability backend:

Data Type	Recommended Retention
Traces	7–30 days
Metrics	30–90 days
Logs	Per compliance requirements

Test Locally Before Deploying

Always test telemetry with Jaeger locally before deploying to production:

# Start Jaeger locally
docker run -d --name jaeger -p 4318:4318 -p 16686:16686 jaegertracing/all-in-one:latest

# Test your agent
await telemetryService.initialize({
  otlpEndpoint: "http://localhost:4318/v1/traces",
});

# View traces at http://localhost:16686

Troubleshooting

No Traces Appearing

Check endpoint URL:

# Verify endpoint is correct and reachable
curl -X POST https://your-backend.com/v1/traces

Verify backend is running:

# For Jaeger
docker ps | grep jaeger

# Check logs
docker logs jaeger

Enable debug logging:

import { diag, DiagConsoleLogger, DiagLogLevel } from "@opentelemetry/api";

diag.setLogger(new DiagConsoleLogger(), DiagLogLevel.DEBUG);

await telemetryService.initialize({
  /* config */
});

Check network connectivity:

# Test connection to OTLP endpoint
nc -zv your-backend.com 4318

High Overhead

If telemetry is causing performance issues:

Reduce sampling ratio:

samplingRatio: 0.05, // Sample only 5% of traces

Disable auto-instrumentation:

enableAutoInstrumentation: false,

Increase export interval:

metricExportIntervalMs: 600000, // Export every 10 minutes

Disable content capture:

captureMessageContent: false,

Content Not Captured

If you need content for debugging but it's not appearing:

Check environment variable:

echo $ADK_CAPTURE_MESSAGE_CONTENT
# Should be 'true' or unset for content capture

Explicitly enable in configuration:

captureMessageContent: true,

Connection Issues

Verify API keys have proper permissions:

# Test API key with curl
curl -H "api-key: $OTEL_API_KEY" https://your-backend.com/v1/traces

Check firewall rules allow outbound connections:

# Check if port is accessible
telnet your-backend.com 4318

Review header format requirements:

Some backends require specific header formats. Check your platform's documentation:

// Datadog example
otlpHeaders: {
  "DD-API-KEY": process.env.DD_API_KEY,
},

// Honeycomb example
otlpHeaders: {
  "x-honeycomb-team": process.env.HONEYCOMB_API_KEY,
  "x-honeycomb-dataset": "my-dataset",
},

Traces Contain Sensitive Data

If you accidentally captured sensitive data:

Immediately disable content capture:

export ADK_CAPTURE_MESSAGE_CONTENT=false

Contact your observability platform to purge affected traces
Review your data retention policies and ensure proper access controls

Metrics Not Appearing

Verify metrics endpoint:

ADK-TS automatically converts trace endpoint to metrics endpoint:

http://localhost:4318/v1/traces → http://localhost:4318/v1/metrics

Check backend supports metrics:

Jaeger only supports traces. Use Grafana/Tempo, Datadog, or Prometheus for metrics.

Verify export interval hasn't expired:

Metrics are exported at intervals. Wait for the configured metricExportIntervalMs before checking.

Production Deployment