16: Observability — Logs, Traces, and Metrics
Chapter 16: Observability — Logs, Traces, and Metrics
Section titled “Chapter 16: Observability — Logs, Traces, and Metrics”Overview
Section titled “Overview”You’ve built intelligent agents. Now you need to understand what they’re doing in production. Observability — the practice of instrumenting systems to expose their internal state — is what separates prototypes from production systems. Without it, you’re flying blind: guessing at performance bottlenecks, missing critical errors, and unable to optimize costs.
In this chapter, you’ll learn to instrument agents with production-grade observability using claude-php/claude-php-agent’s built-in observability infrastructure. You’ll add structured logging with trace correlation, implement distributed tracing with parent-child spans, collect metrics for dashboards and alerts, and export telemetry to industry-standard backends like OpenTelemetry, LangSmith, and LangFuse.
In this chapter you’ll:
- Implement structured logging with PSR-3 loggers and automatic context enrichment
- Build distributed tracing systems with spans, trace IDs, and parent-child relationships
- Collect operational metrics for requests, tokens, latency, and errors
- Integrate OpenTelemetry for industry-standard telemetry export
- Connect to external observability platforms (LangSmith, LangFuse, Arize Phoenix)
- Design production monitoring dashboards and alerting systems
- Apply observability best practices for cost, performance, and reliability
Estimated time: ~120 minutes
::: info Framework Version
This chapter is based on claude-php/claude-php-agent v0.5+. All observability features are built into the framework.
:::
::: info Code examples Complete, runnable examples for this chapter:
01-basic-structured-logging.php— PSR-3 logging with agents02-observability-logger.php— Trace-aware structured logging03-distributed-tracing.php— Hierarchical spans and traces04-metrics-collection.php— Request, token, and latency metrics05-metrics-aggregator.php— Advanced metrics with aggregation05-telemetry-service.php— OpenTelemetry-style metrics06-comprehensive-observability.php— Complete observability stack
All files are in code/16-observability-logs-traces-metrics/.
:::
The Three Pillars of Observability
Section titled “The Three Pillars of Observability”Modern observability is built on three pillars:
┌─────────────────────────────────────────────────────────┐│ OBSERVABILITY │├─────────────────────────────────────────────────────────┤│ ││ 📝 LOGS 📊 TRACES 📈 METRICS ││ ││ What happened? How did it flow? How is it doing? ││ Discrete events Request paths Aggregated stats ││ Full context Parent-child Time series ││ Human-readable Latency analysis Alerting ││ │└─────────────────────────────────────────────────────────┘1. Logs — What Happened
Section titled “1. Logs — What Happened”Structured logs record discrete events with context:
- Agent started/completed
- Tool executed successfully/failed
- Error occurred with stack trace
- User action triggered
Key Properties:
- Rich context (user ID, session, trace ID)
- Severity levels (DEBUG, INFO, ERROR)
- Searchable and filterable
- Retained for audit trails
2. Traces — How Did It Flow
Section titled “2. Traces — How Did It Flow”Distributed traces show request paths through your system:
- Parent span: Agent execution
- Child span: Tool call
- Grandchild span: API request
Key Properties:
- Unique trace ID across all operations
- Parent-child span relationships
- Timing and duration for each span
- Critical path analysis
3. Metrics — How Is It Doing
Section titled “3. Metrics — How Is It Doing”Metrics are aggregated numerical data over time:
- Request count (counter)
- Active requests (gauge)
- Latency distribution (histogram)
- Token usage (counter)
Key Properties:
- Efficient storage (aggregated)
- Real-time dashboards
- Threshold-based alerting
- Trend analysis
Structured Logging Fundamentals
Section titled “Structured Logging Fundamentals”PSR-3 Logger Integration
Section titled “PSR-3 Logger Integration”The framework supports PSR-3 loggers out of the box:
use ClaudeAgents\Support\LoggerFactory;use Psr\Log\LogLevel;
// Console logger (development)$logger = LoggerFactory::createConsole(LogLevel::INFO);
// File logger (production)$logger = LoggerFactory::createFile('/var/log/agent.log', LogLevel::INFO);
// Memory logger (testing)$logger = LoggerFactory::createMemory();Logging Agent Operations
Section titled “Logging Agent Operations”Every significant operation should be logged:
use ClaudeAgents\Agent;use ClaudePhp\ClaudePhp;
$client = new ClaudePhp(apiKey: getenv('ANTHROPIC_API_KEY'));$logger = LoggerFactory::createConsole(LogLevel::INFO);
$agent = Agent::create($client) ->withSystemPrompt('You are a helpful assistant.');
// Log agent creation$logger->info('Agent created', [ 'agent_id' => 'math-assistant', 'system_prompt' => 'You are a helpful assistant.',]);
$startTime = microtime(true);
try { $result = $agent->run('What is 25 * 17?'); $duration = (microtime(true) - $startTime) * 1000;
$logger->info('Agent execution completed', [ 'agent_id' => 'math-assistant', 'answer' => $result->getAnswer(), 'duration_ms' => round($duration, 2), 'tool_calls' => count($result->getToolCalls()), ]);} catch (\Throwable $e) { $duration = (microtime(true) - $startTime) * 1000;
$logger->error('Agent execution failed', [ 'agent_id' => 'math-assistant', 'error' => $e->getMessage(), 'duration_ms' => round($duration, 2), ]);}Key Logging Principles:
✅ Use structured context — Pass arrays, not string interpolation ✅ Include timing — Log duration for performance analysis ✅ Add correlation IDs — Link related operations (user ID, session ID, trace ID) ✅ Log errors with context — Include enough detail to debug ✅ Respect log levels — DEBUG for verbose, INFO for normal, ERROR for failures
❌ Don’t log sensitive data — PII, API keys, passwords ❌ Don’t log excessively — High-volume DEBUG logs hurt performance ❌ Don’t rely on logs alone — Use metrics for aggregation
ObservabilityLogger with Trace Context
Section titled “ObservabilityLogger with Trace Context”The ObservabilityLogger automatically enriches logs with trace context:
use ClaudeAgents\Observability\ObservabilityLogger;use ClaudeAgents\Observability\Tracer;use ClaudeAgents\Support\LoggerFactory;
$baseLogger = LoggerFactory::createConsole(LogLevel::INFO);$tracer = new Tracer();
// Create observability logger with tracer$logger = new ObservabilityLogger($baseLogger, $tracer);
// Set global context (added to all logs)$logger->setGlobalContext([ 'service' => 'agent-api', 'environment' => 'production', 'version' => '2.0.0',]);
// Start a trace$traceId = $tracer->startTrace();$logger->info('Operation started'); // Includes trace_id automatically
// Every log now includes:// - trace_id: Current trace ID// - span_id: Active span ID// - timestamp: Microsecond precision// - memory_usage: Current memory// - service, environment, version: From global contextAutomatic Context Propagation
Section titled “Automatic Context Propagation”When you start a span, all logs automatically include its ID:
$span = $tracer->startSpan('tool_execution', [ 'tool' => 'calculate',]);
$logger->info('Executing tool'); // Includes span_id
$tracer->endSpan($span);This makes it trivial to correlate logs with traces in your log aggregation system (Elasticsearch, Loki, Splunk).
Distributed Tracing with Spans
Section titled “Distributed Tracing with Spans”Understanding Spans and Traces
Section titled “Understanding Spans and Traces”A trace represents a single request or operation. A span represents a unit of work within that trace:
┌──────────────────────────────────────────────────┐│ Trace ID: abc123 │├──────────────────────────────────────────────────┤│ ││ ┌─ agent_execution (Root Span) ││ │ ││ │ ┌─ tool:calculate (Child Span) ││ │ │ ││ │ │ ┌─ api_request (Grandchild Span) ││ │ │ └─ 50ms ││ │ │ ││ │ └─ 75ms ││ │ ││ └─ 150ms ││ │└──────────────────────────────────────────────────┘Creating Spans
Section titled “Creating Spans”use ClaudeAgents\Observability\Tracer;
$tracer = new Tracer();$traceId = $tracer->startTrace();
// Root span for the entire operation$rootSpan = $tracer->startSpan('agent_workflow', [ 'operation' => 'math_calculation',]);
// Child span for tool execution$toolSpan = $tracer->startSpan('tool_execution', [ 'tool' => 'calculate', 'expression' => '25 * 17',], $rootSpan); // Pass parent span
// Add attributes dynamically$toolSpan->setAttribute('result', 425);
// Add events (annotations)$toolSpan->addEvent('calculation_completed', [ 'result' => 425,]);
// Set status$toolSpan->setStatus('OK'); // or 'ERROR'
// End span$tracer->endSpan($toolSpan);
// End root span and trace$tracer->endSpan($rootSpan);$tracer->endTrace();Span Attributes
Section titled “Span Attributes”Spans support rich metadata:
| Attribute Type | Purpose | Example |
|---|---|---|
| Input | Request parameters | prompt, expression, query |
| Output | Response data | answer, result, tokens |
| Timing | Performance markers | start_time, end_time, duration |
| Context | Correlation | user_id, session_id, agent_id |
| Status | Success/failure | OK, ERROR |
Analyzing Traces
Section titled “Analyzing Traces”// Get all completed spans$spans = $tracer->getSpans();
// Get spans for specific trace$traceSpans = $tracer->getSpansByTraceId($traceId);
// Build hierarchical tree$tree = $tracer->buildSpanTree();
// Calculate total duration$totalDuration = $tracer->getTotalDuration();
// Export to OpenTelemetry format$otelData = $tracer->toOpenTelemetry();Metrics Collection
Section titled “Metrics Collection”The Metrics Class
Section titled “The Metrics Class”The Metrics class tracks operational metrics:
use ClaudeAgents\Observability\Metrics;
$metrics = new Metrics();
// Record a successful request$metrics->recordRequest( success: true, tokensInput: 100, tokensOutput: 50, duration: 1500.0, // milliseconds);
// Record a failed request$metrics->recordRequest( success: false, tokensInput: 0, tokensOutput: 0, duration: 500.0, error: 'RateLimitError: Too many requests',);
// Get summary$summary = $metrics->getSummary();/*[ 'total_requests' => 2, 'successful_requests' => 1, 'failed_requests' => 1, 'success_rate' => 0.5, 'total_tokens' => ['input' => 100, 'output' => 50, 'total' => 150], 'total_duration_ms' => 2000.0, 'average_duration_ms' => 1000.0, 'error_counts' => ['RateLimitError' => 1],]*/Key Metrics to Track
Section titled “Key Metrics to Track”Agent Metrics
Section titled “Agent Metrics”- Request count — Total agent invocations
- Success rate — % of successful completions
- Latency — p50, p95, p99 response times
- Token usage — Input, output, and total tokens
- Tool calls — Count and distribution by tool
Tool Metrics
Section titled “Tool Metrics”- Invocation count — Per-tool usage
- Execution time — Per-tool latency
- Failure rate — Tool errors and retries
- Result size — Output data volume
System Metrics
Section titled “System Metrics”- Active requests — Current concurrent operations
- Memory usage — Per-request memory footprint
- Error rates — By error type and severity
- Cost — Estimated API spend
Telemetry Service Integration
Section titled “Telemetry Service Integration”OpenTelemetry Support
Section titled “OpenTelemetry Support”The TelemetryService provides OpenTelemetry-compatible metrics:
use ClaudeAgents\Services\Telemetry\TelemetryService;use ClaudeAgents\Services\Settings\SettingsService;
// Configure telemetry$settings = new SettingsService([ 'telemetry' => [ 'enabled' => true, 'otlp' => [ 'endpoint' => 'http://localhost:4318/v1/metrics', ], ],]);
$telemetry = new TelemetryService($settings);$telemetry->initialize();Metric Types
Section titled “Metric Types”Counters — Cumulative Metrics
Section titled “Counters — Cumulative Metrics”Counters only increase (request count, error count):
// Increment counter$telemetry->recordCounter('agent.requests.total', 1, [ 'agent' => 'math-assistant',]);
$telemetry->recordCounter('tool.executions', 1, [ 'tool' => 'calculate', 'status' => 'success',]);Gauges — Current Values
Section titled “Gauges — Current Values”Gauges represent current state (active requests, memory usage):
// Set current value$telemetry->recordGauge('agent.active_requests', 5.0, [ 'agent' => 'math-assistant',]);
$telemetry->recordGauge('system.memory_mb', 256.5);Histograms — Distributions
Section titled “Histograms — Distributions”Histograms track value distributions (latency, token counts):
// Record latency$telemetry->recordHistogram('agent.duration.ms', 1500.0, [ 'agent' => 'math-assistant',]);
// Record token usage$telemetry->recordHistogram('agent.tokens.input', 150.0);Recording Agent Metrics
Section titled “Recording Agent Metrics”The recordAgentRequest() helper combines all metrics:
$telemetry->recordAgentRequest( success: true, tokensInput: 100, tokensOutput: 50, duration: 1500.0,);
// Equivalent to:// - recordCounter('agent.requests.total')// - recordCounter('agent.requests.success')// - recordHistogram('agent.tokens.input', 100)// - recordHistogram('agent.tokens.output', 50)// - recordHistogram('agent.duration.ms', 1500)Flushing Telemetry
Section titled “Flushing Telemetry”Periodically export metrics to your backend:
// Flush to OTLP endpoint$telemetry->flush();
// In production, flush on:// - Periodic timer (every 60 seconds)// - Request completion// - Shutdown/teardownExternal Tracing Backends
Section titled “External Tracing Backends”Supported Platforms
Section titled “Supported Platforms”claude-php-agent includes integrations for popular AI observability platforms:
| Platform | Focus | Best For |
|---|---|---|
| LangSmith | LangChain ecosystem | Multi-agent workflows, chains |
| LangFuse | Open-source LLM observability | Self-hosted, cost tracking |
| Arize Phoenix | ML observability | Model evaluation, debugging |
LangSmith Integration
Section titled “LangSmith Integration”use ClaudeAgents\Services\Tracing\LangSmithTracer;use ClaudeAgents\Services\Tracing\TraceContext;
$tracer = new LangSmithTracer( apiKey: getenv('LANGSMITH_API_KEY'), projectName: 'php-agent-production',);
// Start trace$context = new TraceContext( traceId: bin2hex(random_bytes(16)), traceName: 'agent_calculation', inputs: ['query' => 'What is 25 * 17?'],);
$tracer->startTrace($context);
// ... execute agent ...
// End trace with outputs$context = new TraceContext( traceId: $context->traceId, traceName: $context->traceName, inputs: $context->inputs, outputs: ['answer' => '425', 'duration_ms' => 1500],);
$tracer->endTrace($context);LangFuse Integration
Section titled “LangFuse Integration”use ClaudeAgents\Services\Tracing\LangFuseTracer;
$tracer = new LangFuseTracer( publicKey: getenv('LANGFUSE_PUBLIC_KEY'), secretKey: getenv('LANGFUSE_SECRET_KEY'),);
// Record spans and metrics$span = new Span(/* ... */);$tracer->recordSpan($span);
$metric = new Metric('agent.duration.ms', 1500.0);$tracer->recordMetric($metric);Arize Phoenix Integration
Section titled “Arize Phoenix Integration”use ClaudeAgents\Services\Tracing\PhoenixTracer;
$tracer = new PhoenixTracer( endpoint: getenv('PHOENIX_ENDPOINT') ?? 'http://localhost:6006',);
// Same API as other tracers$tracer->startTrace($context);$tracer->recordSpan($span);$tracer->endTrace($context);Production Observability System
Section titled “Production Observability System”Complete Observable Agent
Section titled “Complete Observable Agent”Here’s a production-ready agent wrapper with full observability:
class ObservableAgent{ public function __construct( private Agent $agent, private ObservabilityLogger $logger, private Tracer $tracer, private Metrics $metrics, private TelemetryService $telemetry, private string $agentName ) { }
public function run(string $prompt): mixed { // Start trace $traceId = $this->tracer->startTrace(); $rootSpan = $this->tracer->startSpan('agent_execution', [ 'agent' => $this->agentName, 'prompt_length' => strlen($prompt), ]);
$this->logger->info('Agent execution started', [ 'agent' => $this->agentName, 'prompt' => substr($prompt, 0, 100), ]);
$this->telemetry->recordCounter('agent.executions.started', 1, [ 'agent' => $this->agentName, ]);
$startTime = microtime(true);
try { $result = $this->agent->run($prompt); $duration = (microtime(true) - $startTime) * 1000;
// Get token usage $usage = $result->getTokenUsage(); $inputTokens = $usage['input']; $outputTokens = $usage['output'];
// Record metrics $this->metrics->recordRequest( success: true, tokensInput: $inputTokens, tokensOutput: $outputTokens, duration: $duration );
// Record telemetry $this->telemetry->recordAgentRequest( success: true, tokensInput: $inputTokens, tokensOutput: $outputTokens, duration: $duration );
// Update span $rootSpan->setAttribute('answer_length', strlen($result->getAnswer())); $rootSpan->setAttribute('tool_calls', count($result->getToolCalls())); $rootSpan->setStatus('OK');
$this->logger->info('Agent execution completed', [ 'agent' => $this->agentName, 'duration_ms' => round($duration, 2), 'tokens' => ['input' => $inputTokens, 'output' => $outputTokens], ]);
return $result; } catch (\Throwable $e) { $duration = (microtime(true) - $startTime) * 1000;
// Record failures $this->metrics->recordRequest( success: false, tokensInput: 0, tokensOutput: 0, duration: $duration, error: get_class($e) . ': ' . $e->getMessage() );
$this->telemetry->recordAgentRequest( success: false, tokensInput: 0, tokensOutput: 0, duration: $duration, error: get_class($e) . ': ' . $e->getMessage() );
$rootSpan->setStatus('ERROR', $e->getMessage()); $this->logger->logException($e, 'Agent execution failed');
throw $e; } finally { $this->tracer->endSpan($rootSpan); $this->tracer->endTrace(); } }}// Initialize observability stack$baseLogger = LoggerFactory::createConsole(LogLevel::INFO);$tracer = new Tracer();$logger = new ObservabilityLogger($baseLogger, $tracer);$metrics = new Metrics();$telemetry = new TelemetryService($settings);
// Create observable agent$baseAgent = Agent::create($client) ->withTool($calculator) ->withSystemPrompt('You are a helpful math assistant.');
$agent = new ObservableAgent( agent: $baseAgent, logger: $logger, tracer: $tracer, metrics: $metrics, telemetry: $telemetry, agentName: 'math-assistant');
// Run with full observability$result = $agent->run('What is 25 * 17?');Monitoring Dashboards
Section titled “Monitoring Dashboards”Key Dashboard Panels
Section titled “Key Dashboard Panels”Request Volume and Health
Section titled “Request Volume and Health”┌─────────────────────────────────────────────┐│ Request Rate (requests/min) ││ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ ││ Current: 45 req/min ││ Peak: 72 req/min (14:23) │└─────────────────────────────────────────────┘
┌─────────────────────────────────────────────┐│ Success Rate (%) ││ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ ││ Current: 98.5% ││ Target: > 99.0% │└─────────────────────────────────────────────┘Latency Distribution
Section titled “Latency Distribution”┌─────────────────────────────────────────────┐│ Response Time (ms) ││ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ ││ p50: 1,200ms ││ p95: 3,500ms ││ p99: 5,800ms │└─────────────────────────────────────────────┘Token Usage and Cost
Section titled “Token Usage and Cost”┌─────────────────────────────────────────────┐│ Token Usage (tokens/request) ││ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ ││ Input: avg 150, total 45K ││ Output: avg 75, total 22.5K ││ Cost: $12.50/hour │└─────────────────────────────────────────────┘Error Breakdown
Section titled “Error Breakdown”┌─────────────────────────────────────────────┐│ Errors by Type ││ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ ││ RateLimitError: 12 (60%) ││ TimeoutException: 5 (25%) ││ ValidationError: 3 (15%) │└─────────────────────────────────────────────┘Alert Rules
Section titled “Alert Rules”Configure alerts for critical thresholds:
| Metric | Threshold | Action |
|---|---|---|
| Success Rate | < 95% | Page on-call engineer |
| p95 Latency | > 5000ms | Investigate performance |
| Error Rate | > 5% | Check error logs |
| Token Usage | > 1M/hour | Review cost optimization |
| Active Requests | > 100 | Check for runaway processes |
Observability Best Practices
Section titled “Observability Best Practices”1. Log Hygiene
Section titled “1. Log Hygiene”✅ Use appropriate log levels:
$logger->debug('Tool parameter validation passed'); // DEBUG$logger->info('Agent execution started'); // INFO$logger->warning('Retry attempt 2 of 3'); // WARNING$logger->error('Tool execution failed'); // ERROR$logger->critical('Database connection lost'); // CRITICAL✅ Include correlation IDs:
$logger->info('Request started', [ 'request_id' => $requestId, 'user_id' => $userId, 'trace_id' => $traceId,]);❌ Don’t log sensitive data:
// BAD$logger->info('User authenticated', [ 'password' => $password, // NEVER log credentials 'api_key' => $apiKey, // NEVER log secrets]);
// GOOD$logger->info('User authenticated', [ 'user_id' => $userId, 'auth_method' => 'password',]);2. Span Design
Section titled “2. Span Design”✅ Create spans for significant operations:
- Agent execution
- Tool calls
- External API requests
- Database queries
- File operations
✅ Add meaningful attributes:
$span->setAttribute('tool', 'calculate');$span->setAttribute('expression', '25 * 17');$span->setAttribute('result', 425);$span->setAttribute('cache_hit', true);❌ Don’t create excessive spans:
// BAD: Too granular$span1 = $tracer->startSpan('validate_input');$span2 = $tracer->startSpan('parse_input');$span3 = $tracer->startSpan('sanitize_input');
// GOOD: Appropriate granularity$span = $tracer->startSpan('process_input');3. Metric Selection
Section titled “3. Metric Selection”✅ Track actionable metrics:
- Success rate → Alerts for degradation
- Latency percentiles → Performance optimization
- Token usage → Cost management
- Error rates by type → Debugging priorities
✅ Use correct metric types:
// Counter: Things that accumulate$telemetry->recordCounter('requests.total');
// Gauge: Current state$telemetry->recordGauge('active_requests', 5.0);
// Histogram: Distributions$telemetry->recordHistogram('latency.ms', 1500.0);❌ Don’t track vanity metrics:
// BAD: Not actionable$telemetry->recordCounter('button_clicks');$telemetry->recordGauge('favorite_color');4. Performance Impact
Section titled “4. Performance Impact”✅ Sample in high-volume scenarios:
// Sample 1% of traces in productionif (mt_rand(1, 100) === 1) { $traceId = $tracer->startTrace();}✅ Use async logging:
// Queue logs for async processing$logger->info('Event occurred', ['data' => $largeData]);❌ Don’t block on observability:
// BAD: Synchronous export blocks request$telemetry->flush(); // Blocks for network call
// GOOD: Async export in backgrounddispatch(fn() => $telemetry->flush());Production Architecture
Section titled “Production Architecture”Recommended Stack
Section titled “Recommended Stack”┌──────────────────────────────────────────────────┐│ PHP Application ││ (claude-php-agent) │└────────┬──────────────┬──────────────┬───────────┘ │ │ │ Logs │ Traces │ Metrics │ ▼ ▼ ▼┌────────────┐ ┌────────────┐ ┌────────────┐│ Loki │ │ Tempo │ │ Prometheus ││ or │ │ or │ │ or ││ Elastic │ │ Jaeger │ │ Grafana │└────────────┘ └────────────┘ └────────────┘ │ │ │ └──────────────┴──────────────┘ │ ▼ ┌─────────────────────────────┐ │ Grafana Dashboard │ │ Alertmanager │ └─────────────────────────────┘OpenTelemetry Collector
Section titled “OpenTelemetry Collector”For production, use the OpenTelemetry Collector as a central hub:
receivers: otlp: protocols: http: endpoint: 0.0.0.0:4318
processors: batch: timeout: 10s send_batch_size: 1000
exporters: prometheus: endpoint: "0.0.0.0:8889" jaeger: endpoint: "jaeger:14250" loki: endpoint: "http://loki:3100/loki/api/v1/push"
service: pipelines: metrics: receivers: [otlp] processors: [batch] exporters: [prometheus] traces: receivers: [otlp] processors: [batch] exporters: [jaeger] logs: receivers: [otlp] processors: [batch] exporters: [loki]Configuration
Section titled “Configuration”$settings = new SettingsService([ 'telemetry' => [ 'enabled' => true, 'otlp' => [ 'endpoint' => getenv('OTLP_ENDPOINT'), 'headers' => [ 'Authorization' => 'Bearer ' . getenv('OTLP_TOKEN'), ], ], 'sampling' => [ 'rate' => 0.1, // Sample 10% of traces ], ],]);Debugging with Observability
Section titled “Debugging with Observability”Finding Slow Requests
Section titled “Finding Slow Requests”// Query: Find requests > 5 seconds// Prometheus: agent_duration_ms{quantile="0.95"} > 5000// Loki: {agent="math-assistant"} | duration_ms > 5000Identifying Error Patterns
Section titled “Identifying Error Patterns”// Query: Error rate by type// Prometheus: rate(agent_requests_failed[5m]) by (error_type)// Loki: {agent="math-assistant"} | level="error" | json | count by error_typeTracing Request Flow
Section titled “Tracing Request Flow”// Find trace by ID$spans = $tracer->getSpansByTraceId($traceId);
// Analyze critical path$tree = $tracer->buildSpanTree();foreach ($tree as $node) { $span = $node['span']; echo "{$span->getName()}: {$span->getDuration()}ms\n"; // Identify slowest span}Exercise: Build Your Observability Dashboard
Section titled “Exercise: Build Your Observability Dashboard”Task: Create a Grafana dashboard for your agent system.
Requirements:
-
Request Metrics Panel:
- Request rate (requests/min)
- Success rate (%)
- Error rate (%)
-
Latency Panel:
- p50, p95, p99 latency
- Latency histogram
-
Token Usage Panel:
- Input tokens/request
- Output tokens/request
- Estimated cost/hour
-
Error Panel:
- Error count by type
- Recent error logs
-
Alerts:
- Success rate < 95%
- p95 latency > 5s
- Error rate > 5%
Summary
Section titled “Summary”In this chapter, you learned:
✅ Structured Logging — PSR-3 integration with context enrichment ✅ Distributed Tracing — Spans, traces, and parent-child relationships ✅ Metrics Collection — Counters, gauges, histograms for dashboards ✅ OpenTelemetry — Industry-standard telemetry export ✅ External Platforms — LangSmith, LangFuse, Arize Phoenix integration ✅ Production Patterns — Observable agents, monitoring, and alerting ✅ Best Practices — Log hygiene, span design, metric selection, performance
Key Takeaways:
- Observability is not optional — Production systems need logs, traces, and metrics
- Instrument early — Add observability from the start, not as an afterthought
- Use structured logging — Context-rich logs beat string interpolation
- Correlate with traces — Link logs to traces via trace_id
- Track what matters — Focus on actionable metrics (success rate, latency, cost)
- Export to backends — Use OpenTelemetry for vendor-neutral telemetry
- Monitor and alert — Set thresholds for critical metrics
Next Steps
Section titled “Next Steps”With observability in place, you can now measure quality. In the next chapter, you’ll build evaluation harnesses to systematically test agent accuracy, safety, and cost.
→ Chapter 17: Evaluation Harnesses and QA
Build offline evals, golden tests, and regression suites to measure accuracy, cost, and safety on real task sets.
Additional Resources
Section titled “Additional Resources”Documentation
Section titled “Documentation”External Platforms
Section titled “External Platforms”- Jaeger — Distributed tracing
- Grafana Tempo — Distributed tracing backend
- Grafana Loki — Log aggregation
- Prometheus — Metrics and alerting