AI Agent Observability: Monitor, Debug, and Improve Your Agents in Production

Running AI agents in production without observability is flying blind. Traditional application monitoring (uptime, CPU, memory) covers maybe 20% of what you need. The other 80%—LLM output quality, cost per task, hallucination rates, prompt drift—requires purpose-built tooling. Here is what to track, how to track it, and which tools to use.

Key metrics to monitor

Latency

Measure time-to-completion for the full agent task, not just individual LLM calls. An agent that makes 5 sequential LLM calls, 2 tool calls, and a database lookup has a very different latency profile than a single API call.

What to track:

P50 latency: Median task completion time. This is your "normal" performance.
P95 latency: The slowest 5% of tasks. Spikes here indicate edge cases or resource contention.
P99 latency: Your worst-case performance. If P99 is 10x P50, you have reliability issues.
Per-step breakdown: Which step takes the longest? LLM inference, tool execution, or data retrieval? You cannot optimize what you cannot isolate.

Benchmarks: For a typical multi-step agent (3-5 LLM calls + tool use), expect P50 of 5-15 seconds. P95 under 30 seconds is good. P95 above 60 seconds means something is wrong—investigate LLM latency spikes, slow tool APIs, or inefficient sequential processing that could be parallelized.

Cost per task

Every agent execution has a dollar cost. Track it at the task level, not just the monthly bill.

What to track:

LLM token costs: Input tokens + output tokens, priced by model tier. A task that uses 3,000 input tokens and 500 output tokens on GPT-4o costs about $0.01. The same task on Claude Opus costs $0.06.
Tool/API costs: External API calls (search, database queries, SaaS integrations) often have per-call pricing.
Infrastructure costs: Compute, storage, vector database queries. Amortize across tasks.
Total cost per task: Sum of the above. Track the distribution, not just the average. A few expensive outlier tasks can skew the average.

Alert thresholds: Set alerts at 2x your average cost per task. If the average is $0.05 and a task costs $0.10+, investigate. Common causes: retry loops, excessively long context windows, or the agent calling expensive tools unnecessarily.

Success and failure rates

Task success rate: What percentage of agent tasks complete successfully without human intervention? Target: 85-95% depending on task complexity.

Failure categories matter more than the overall rate:

LLM errors: Rate limits, timeouts, malformed responses. These are infrastructure issues.
Tool failures: External APIs down, authentication expired, unexpected response formats. These are integration issues.
Logic failures: Agent completed the task but produced an incorrect result. These are quality issues and the hardest to detect automatically.
Guardrail triggers: Agent attempted a disallowed action and was blocked. These might be working correctly (the guardrail caught a problem) or indicate the agent is misunderstanding the task.

Hallucination rate

The single most dangerous failure mode. The agent confidently produces incorrect information, and downstream systems or users act on it.

Detection methods:

Factual grounding checks: Compare agent outputs against source data. If the agent says "the contract expires on March 15" but the contract says April 15, that is a hallucination. Automate this with a verification step using a separate LLM call.
Citation verification: If your agent cites sources, verify the citations exist and say what the agent claims they say. Automated citation checking catches 60-80% of factual hallucinations.
Confidence scoring: Some frameworks expose the model's confidence. Low confidence correlates with higher hallucination risk—flag and review these outputs.
Human sampling: Review a random 5-10% sample of outputs weekly. This catches hallucinations that automated methods miss.

Benchmark: Well-configured RAG-based agents with good source data achieve hallucination rates of 2-5%. Without grounding, rates climb to 15-25%. If you are above 5%, improve your retrieval pipeline before scaling up.

Logging and tracing

Standard logging (text to a file) is not enough. You need distributed tracing that tracks an agent task across every step, LLM call, tool invocation, and decision point.

What a good trace includes:

Unique trace ID linking all steps of a single task
Parent-child relationships between steps (which LLM call triggered which tool call)
Input and output for each step (with PII redaction)
Token counts and costs per LLM call
Latency per step
Model version and prompt version used
Any errors, retries, or fallbacks

Trace storage: Plan for volume. An agent making 1,000 tasks per day with 5 steps per task generates 5,000 trace spans per day. At 2-5 KB per span, that is 10-25 MB per day, or 300-750 MB per month. Not huge, but it adds up. Set retention policies: full traces for 30 days, aggregated metrics for 12 months.

Drift detection

Agent performance degrades over time. This is not a bug—it is entropy. The world changes, user behavior shifts, source data is updated, and upstream APIs modify their responses.

Types of drift:

Input drift: The distribution of inputs changes. Your support agent was trained on technical questions, but marketing started directing billing inquiries to it. Monitor input topic distribution over time.
Output quality drift: The agent's outputs gradually decrease in quality. Often caused by source data going stale, retrieval quality degrading as the knowledge base grows, or model provider updates changing behavior.
Cost drift: Average cost per task creeps up. Usually caused by inputs getting longer (more context), more retries, or tool usage patterns changing.
Behavioral drift: The agent starts taking actions in different proportions than expected. If your routing agent suddenly sends 40% of tickets to Tier 2 instead of the usual 20%, something changed.

How to detect drift:

Track weekly rolling averages for all key metrics. Compare current week to the 4-week average. Flag any metric that moves more than 15%.
Run your evaluation test suite weekly against a fixed dataset. If accuracy drops below your threshold, investigate before it hits production.
Monitor input/output distributions using statistical tests (population stability index, KL divergence) or simpler thresholds on category distributions.

Alerting

Set alerts that drive action, not noise.

Critical alerts (page someone):

Task success rate drops below 70% in a 1-hour window
Cost per task exceeds 5x the average
Agent takes a disallowed action (guardrail breach)
All LLM calls failing (provider outage)

Warning alerts (Slack notification):

Success rate drops below 85% over 4 hours
P95 latency exceeds 2x normal
Daily cost exceeds 130% of budget
Hallucination rate (from automated checks) exceeds threshold

Weekly digest (email/dashboard):

Total tasks processed, success rate, cost
Drift indicators
Top failure reasons
Cost trend (week-over-week)

Observability tools

Tool	Strengths	Pricing	Best For
LangSmith	Deep LangChain integration, trace visualization, eval framework, prompt versioning	Free tier (5K traces/month), Plus $39/month, Enterprise custom	Teams using LangChain/LangGraph
Helicone	LLM cost tracking, request logging, caching, rate limiting	Free tier (100K requests/month), Pro from $80/month	Cost-focused monitoring, multi-provider setups
Arize Phoenix	Open-source tracing, drift detection, embedding visualization	Free (open source), Enterprise for managed hosting	Teams wanting self-hosted observability
Weights & Biases (Weave)	Experiment tracking, evaluation, production monitoring	Free tier, Team $50/seat/month	Teams already using W&B for ML experiments
Portkey	Multi-provider gateway with built-in observability, caching, fallbacks	Free tier (10K requests/month), Growth from $49/month	Multi-model routing with monitoring
Braintrust	Eval-first platform with logging, scoring, and prompt management	Free tier, Pro from $50/month	Teams prioritizing automated evaluation
Datadog LLM Observability	Integrates with existing Datadog infrastructure monitoring	Add-on to Datadog pricing	Enterprise teams already on Datadog

Recommendation: If you are starting from scratch, start with LangSmith or Helicone—both have generous free tiers and cover the essentials. If you already have a monitoring stack (Datadog, Grafana), check if they offer LLM observability add-ons before adding a new tool.

Cost benchmarks for observability

Observability itself has a cost. Budget for it.

Agent Volume	Recommended Setup	Estimated Monthly Cost
Under 5K tasks/month	LangSmith or Helicone free tier + manual weekly reviews	$0
5K-50K tasks/month	Helicone Pro or LangSmith Plus + automated drift alerts	$40-100/month
50K-500K tasks/month	Dedicated observability platform + custom dashboards	$200-800/month
500K+ tasks/month	Enterprise platform (Datadog LLM, Arize) + dedicated SRE time	$1,000-5,000/month

As a rule of thumb, budget 3-5% of your total AI agent operating costs for observability. If you are spending $2,000/month on LLM APIs and infrastructure, spend $60-100/month on monitoring those systems.

Getting started: minimum viable observability

If you deploy one agent tomorrow and need observability by end of week, here is the minimum setup:

Add tracing. Wrap your agent with LangSmith or Helicone. Both require adding 2-3 lines of code or an environment variable. This gives you traces, latency, and cost per task immediately.
Set up cost alerts. Configure a daily budget cap with email alerts. Takes 10 minutes in any observability tool.
Create a daily dashboard. Total tasks, success rate, average cost, P95 latency. Four numbers on one screen.
Schedule a weekly review. Spend 30 minutes each week looking at failure logs, cost trends, and a sample of agent outputs. This catches drift that automated alerts miss.

You can build out from there—automated eval pipelines, drift detection, custom metrics—but these four steps cover the critical needs for any agent in production.

For building evaluation test suites, see How to Evaluate and Test AI Agents. For reducing the costs you are monitoring, read AI Agent Cost Optimization Guide. Explore the full AI Operations Agent niche for more production guides.

Key metrics to monitor

Latency

What to track:

P50 latency: Median task completion time. This is your "normal" performance.
P95 latency: The slowest 5% of tasks. Spikes here indicate edge cases or resource contention.
P99 latency: Your worst-case performance. If P99 is 10x P50, you have reliability issues.
Per-step breakdown: Which step takes the longest? LLM inference, tool execution, or data retrieval? You cannot optimize what you cannot isolate.

Cost per task

Every agent execution has a dollar cost. Track it at the task level, not just the monthly bill.

What to track:

LLM token costs: Input tokens + output tokens, priced by model tier. A task that uses 3,000 input tokens and 500 output tokens on GPT-4o costs about $0.01. The same task on Claude Opus costs $0.06.
Tool/API costs: External API calls (search, database queries, SaaS integrations) often have per-call pricing.
Infrastructure costs: Compute, storage, vector database queries. Amortize across tasks.
Total cost per task: Sum of the above. Track the distribution, not just the average. A few expensive outlier tasks can skew the average.

Success and failure rates

Task success rate: What percentage of agent tasks complete successfully without human intervention? Target: 85-95% depending on task complexity.

Failure categories matter more than the overall rate:

LLM errors: Rate limits, timeouts, malformed responses. These are infrastructure issues.
Tool failures: External APIs down, authentication expired, unexpected response formats. These are integration issues.
Logic failures: Agent completed the task but produced an incorrect result. These are quality issues and the hardest to detect automatically.
Guardrail triggers: Agent attempted a disallowed action and was blocked. These might be working correctly (the guardrail caught a problem) or indicate the agent is misunderstanding the task.

Hallucination rate

The single most dangerous failure mode. The agent confidently produces incorrect information, and downstream systems or users act on it.

Detection methods:

Factual grounding checks: Compare agent outputs against source data. If the agent says "the contract expires on March 15" but the contract says April 15, that is a hallucination. Automate this with a verification step using a separate LLM call.
Citation verification: If your agent cites sources, verify the citations exist and say what the agent claims they say. Automated citation checking catches 60-80% of factual hallucinations.
Confidence scoring: Some frameworks expose the model's confidence. Low confidence correlates with higher hallucination risk—flag and review these outputs.
Human sampling: Review a random 5-10% sample of outputs weekly. This catches hallucinations that automated methods miss.

Logging and tracing

Standard logging (text to a file) is not enough. You need distributed tracing that tracks an agent task across every step, LLM call, tool invocation, and decision point.

What a good trace includes:

Unique trace ID linking all steps of a single task
Parent-child relationships between steps (which LLM call triggered which tool call)
Input and output for each step (with PII redaction)
Token counts and costs per LLM call
Latency per step
Model version and prompt version used
Any errors, retries, or fallbacks

Drift detection

Agent performance degrades over time. This is not a bug—it is entropy. The world changes, user behavior shifts, source data is updated, and upstream APIs modify their responses.

Types of drift:

Input drift: The distribution of inputs changes. Your support agent was trained on technical questions, but marketing started directing billing inquiries to it. Monitor input topic distribution over time.
Output quality drift: The agent's outputs gradually decrease in quality. Often caused by source data going stale, retrieval quality degrading as the knowledge base grows, or model provider updates changing behavior.
Cost drift: Average cost per task creeps up. Usually caused by inputs getting longer (more context), more retries, or tool usage patterns changing.
Behavioral drift: The agent starts taking actions in different proportions than expected. If your routing agent suddenly sends 40% of tickets to Tier 2 instead of the usual 20%, something changed.

How to detect drift:

Track weekly rolling averages for all key metrics. Compare current week to the 4-week average. Flag any metric that moves more than 15%.
Run your evaluation test suite weekly against a fixed dataset. If accuracy drops below your threshold, investigate before it hits production.
Monitor input/output distributions using statistical tests (population stability index, KL divergence) or simpler thresholds on category distributions.

Alerting

Set alerts that drive action, not noise.

Critical alerts (page someone):

Task success rate drops below 70% in a 1-hour window
Cost per task exceeds 5x the average
Agent takes a disallowed action (guardrail breach)
All LLM calls failing (provider outage)

Warning alerts (Slack notification):

Success rate drops below 85% over 4 hours
P95 latency exceeds 2x normal
Daily cost exceeds 130% of budget
Hallucination rate (from automated checks) exceeds threshold

Weekly digest (email/dashboard):

Total tasks processed, success rate, cost
Drift indicators
Top failure reasons
Cost trend (week-over-week)

Observability tools

Tool	Strengths	Pricing	Best For
LangSmith	Deep LangChain integration, trace visualization, eval framework, prompt versioning	Free tier (5K traces/month), Plus $39/month, Enterprise custom	Teams using LangChain/LangGraph
Helicone	LLM cost tracking, request logging, caching, rate limiting	Free tier (100K requests/month), Pro from $80/month	Cost-focused monitoring, multi-provider setups
Arize Phoenix	Open-source tracing, drift detection, embedding visualization	Free (open source), Enterprise for managed hosting	Teams wanting self-hosted observability
Weights & Biases (Weave)	Experiment tracking, evaluation, production monitoring	Free tier, Team $50/seat/month	Teams already using W&B for ML experiments
Portkey	Multi-provider gateway with built-in observability, caching, fallbacks	Free tier (10K requests/month), Growth from $49/month	Multi-model routing with monitoring
Braintrust	Eval-first platform with logging, scoring, and prompt management	Free tier, Pro from $50/month	Teams prioritizing automated evaluation
Datadog LLM Observability	Integrates with existing Datadog infrastructure monitoring	Add-on to Datadog pricing	Enterprise teams already on Datadog

Cost benchmarks for observability

Observability itself has a cost. Budget for it.

Agent Volume	Recommended Setup	Estimated Monthly Cost
Under 5K tasks/month	LangSmith or Helicone free tier + manual weekly reviews	$0
5K-50K tasks/month	Helicone Pro or LangSmith Plus + automated drift alerts	$40-100/month
50K-500K tasks/month	Dedicated observability platform + custom dashboards	$200-800/month
500K+ tasks/month	Enterprise platform (Datadog LLM, Arize) + dedicated SRE time	$1,000-5,000/month

Getting started: minimum viable observability

If you deploy one agent tomorrow and need observability by end of week, here is the minimum setup:

Add tracing. Wrap your agent with LangSmith or Helicone. Both require adding 2-3 lines of code or an environment variable. This gives you traces, latency, and cost per task immediately.
Set up cost alerts. Configure a daily budget cap with email alerts. Takes 10 minutes in any observability tool.
Create a daily dashboard. Total tasks, success rate, average cost, P95 latency. Four numbers on one screen.
Schedule a weekly review. Spend 30 minutes each week looking at failure logs, cost trends, and a sample of agent outputs. This catches drift that automated alerts miss.

You can build out from there—automated eval pipelines, drift detection, custom metrics—but these four steps cover the critical needs for any agent in production.

AI Agent Observability: Monitor, Debug, and Improve Your Agents in Production

Key metrics to monitor

Latency

Cost per task

Success and failure rates

Hallucination rate

Logging and tracing

Drift detection

Alerting

Observability tools

Cost benchmarks for observability

Getting started: minimum viable observability

Get the AI agent deployment checklist

Put this to work — which are you?

Related posts

AI Agent Observability: Monitor, Debug, and Improve Your Agents in Production

Key metrics to monitor

Latency

Cost per task

Success and failure rates

Hallucination rate

Logging and tracing

Drift detection

Alerting

Observability tools

Cost benchmarks for observability

Getting started: minimum viable observability

Get the AI agent deployment checklist

Put this to work — which are you?

Related posts