AI Agent Observability: Monitor, Debug, and Improve Your Agents in Production
March 24, 2026
By AgentMelt Team
Running AI agents in production without observability is flying blind. Traditional application monitoring (uptime, CPU, memory) covers maybe 20% of what you need. The other 80%—LLM output quality, cost per task, hallucination rates, prompt drift—requires purpose-built tooling. Here is what to track, how to track it, and which tools to use.
Key metrics to monitor
Latency
Measure time-to-completion for the full agent task, not just individual LLM calls. An agent that makes 5 sequential LLM calls, 2 tool calls, and a database lookup has a very different latency profile than a single API call.
What to track:
- P50 latency: Median task completion time. This is your "normal" performance.
- P95 latency: The slowest 5% of tasks. Spikes here indicate edge cases or resource contention.
- P99 latency: Your worst-case performance. If P99 is 10x P50, you have reliability issues.
- Per-step breakdown: Which step takes the longest? LLM inference, tool execution, or data retrieval? You cannot optimize what you cannot isolate.
Benchmarks: For a typical multi-step agent (3-5 LLM calls + tool use), expect P50 of 5-15 seconds. P95 under 30 seconds is good. P95 above 60 seconds means something is wrong—investigate LLM latency spikes, slow tool APIs, or inefficient sequential processing that could be parallelized.
Cost per task
Every agent execution has a dollar cost. Track it at the task level, not just the monthly bill.
What to track:
- LLM token costs: Input tokens + output tokens, priced by model tier. A task that uses 3,000 input tokens and 500 output tokens on GPT-4o costs about $0.01. The same task on Claude Opus costs $0.06.
- Tool/API costs: External API calls (search, database queries, SaaS integrations) often have per-call pricing.
- Infrastructure costs: Compute, storage, vector database queries. Amortize across tasks.
- Total cost per task: Sum of the above. Track the distribution, not just the average. A few expensive outlier tasks can skew the average.
Alert thresholds: Set alerts at 2x your average cost per task. If the average is $0.05 and a task costs $0.10+, investigate. Common causes: retry loops, excessively long context windows, or the agent calling expensive tools unnecessarily.
Success and failure rates
Task success rate: What percentage of agent tasks complete successfully without human intervention? Target: 85-95% depending on task complexity.
Failure categories matter more than the overall rate:
- LLM errors: Rate limits, timeouts, malformed responses. These are infrastructure issues.
- Tool failures: External APIs down, authentication expired, unexpected response formats. These are integration issues.
- Logic failures: Agent completed the task but produced an incorrect result. These are quality issues and the hardest to detect automatically.
- Guardrail triggers: Agent attempted a disallowed action and was blocked. These might be working correctly (the guardrail caught a problem) or indicate the agent is misunderstanding the task.
Hallucination rate
The single most dangerous failure mode. The agent confidently produces incorrect information, and downstream systems or users act on it.
Detection methods:
- Factual grounding checks: Compare agent outputs against source data. If the agent says "the contract expires on March 15" but the contract says April 15, that is a hallucination. Automate this with a verification step using a separate LLM call.
- Citation verification: If your agent cites sources, verify the citations exist and say what the agent claims they say. Automated citation checking catches 60-80% of factual hallucinations.
- Confidence scoring: Some frameworks expose the model's confidence. Low confidence correlates with higher hallucination risk—flag and review these outputs.
- Human sampling: Review a random 5-10% sample of outputs weekly. This catches hallucinations that automated methods miss.
Benchmark: Well-configured RAG-based agents with good source data achieve hallucination rates of 2-5%. Without grounding, rates climb to 15-25%. If you are above 5%, improve your retrieval pipeline before scaling up.
Logging and tracing
Standard logging (text to a file) is not enough. You need distributed tracing that tracks an agent task across every step, LLM call, tool invocation, and decision point.
What a good trace includes:
- Unique trace ID linking all steps of a single task
- Parent-child relationships between steps (which LLM call triggered which tool call)
- Input and output for each step (with PII redaction)
- Token counts and costs per LLM call
- Latency per step
- Model version and prompt version used
- Any errors, retries, or fallbacks
Trace storage: Plan for volume. An agent making 1,000 tasks per day with 5 steps per task generates 5,000 trace spans per day. At 2-5 KB per span, that is 10-25 MB per day, or 300-750 MB per month. Not huge, but it adds up. Set retention policies: full traces for 30 days, aggregated metrics for 12 months.
Drift detection
Agent performance degrades over time. This is not a bug—it is entropy. The world changes, user behavior shifts, source data is updated, and upstream APIs modify their responses.
Types of drift:
- Input drift: The distribution of inputs changes. Your support agent was trained on technical questions, but marketing started directing billing inquiries to it. Monitor input topic distribution over time.
- Output quality drift: The agent's outputs gradually decrease in quality. Often caused by source data going stale, retrieval quality degrading as the knowledge base grows, or model provider updates changing behavior.
- Cost drift: Average cost per task creeps up. Usually caused by inputs getting longer (more context), more retries, or tool usage patterns changing.
- Behavioral drift: The agent starts taking actions in different proportions than expected. If your routing agent suddenly sends 40% of tickets to Tier 2 instead of the usual 20%, something changed.
How to detect drift:
- Track weekly rolling averages for all key metrics. Compare current week to the 4-week average. Flag any metric that moves more than 15%.
- Run your evaluation test suite weekly against a fixed dataset. If accuracy drops below your threshold, investigate before it hits production.
- Monitor input/output distributions using statistical tests (population stability index, KL divergence) or simpler thresholds on category distributions.
Alerting
Set alerts that drive action, not noise.
Critical alerts (page someone):
- Task success rate drops below 70% in a 1-hour window
- Cost per task exceeds 5x the average
- Agent takes a disallowed action (guardrail breach)
- All LLM calls failing (provider outage)
Warning alerts (Slack notification):
- Success rate drops below 85% over 4 hours
- P95 latency exceeds 2x normal
- Daily cost exceeds 130% of budget
- Hallucination rate (from automated checks) exceeds threshold
Weekly digest (email/dashboard):
- Total tasks processed, success rate, cost
- Drift indicators
- Top failure reasons
- Cost trend (week-over-week)
Observability tools
| Tool | Strengths | Pricing | Best For |
|---|---|---|---|
| LangSmith | Deep LangChain integration, trace visualization, eval framework, prompt versioning | Free tier (5K traces/month), Plus $39/month, Enterprise custom | Teams using LangChain/LangGraph |
| Helicone | LLM cost tracking, request logging, caching, rate limiting | Free tier (100K requests/month), Pro from $80/month | Cost-focused monitoring, multi-provider setups |
| Arize Phoenix | Open-source tracing, drift detection, embedding visualization | Free (open source), Enterprise for managed hosting | Teams wanting self-hosted observability |
| Weights & Biases (Weave) | Experiment tracking, evaluation, production monitoring | Free tier, Team $50/seat/month | Teams already using W&B for ML experiments |
| Portkey | Multi-provider gateway with built-in observability, caching, fallbacks | Free tier (10K requests/month), Growth from $49/month | Multi-model routing with monitoring |
| Braintrust | Eval-first platform with logging, scoring, and prompt management | Free tier, Pro from $50/month | Teams prioritizing automated evaluation |
| Datadog LLM Observability | Integrates with existing Datadog infrastructure monitoring | Add-on to Datadog pricing | Enterprise teams already on Datadog |
Recommendation: If you are starting from scratch, start with LangSmith or Helicone—both have generous free tiers and cover the essentials. If you already have a monitoring stack (Datadog, Grafana), check if they offer LLM observability add-ons before adding a new tool.
Cost benchmarks for observability
Observability itself has a cost. Budget for it.
| Agent Volume | Recommended Setup | Estimated Monthly Cost |
|---|---|---|
| Under 5K tasks/month | LangSmith or Helicone free tier + manual weekly reviews | $0 |
| 5K-50K tasks/month | Helicone Pro or LangSmith Plus + automated drift alerts | $40-100/month |
| 50K-500K tasks/month | Dedicated observability platform + custom dashboards | $200-800/month |
| 500K+ tasks/month | Enterprise platform (Datadog LLM, Arize) + dedicated SRE time | $1,000-5,000/month |
As a rule of thumb, budget 3-5% of your total AI agent operating costs for observability. If you are spending $2,000/month on LLM APIs and infrastructure, spend $60-100/month on monitoring those systems.
Getting started: minimum viable observability
If you deploy one agent tomorrow and need observability by end of week, here is the minimum setup:
- Add tracing. Wrap your agent with LangSmith or Helicone. Both require adding 2-3 lines of code or an environment variable. This gives you traces, latency, and cost per task immediately.
- Set up cost alerts. Configure a daily budget cap with email alerts. Takes 10 minutes in any observability tool.
- Create a daily dashboard. Total tasks, success rate, average cost, P95 latency. Four numbers on one screen.
- Schedule a weekly review. Spend 30 minutes each week looking at failure logs, cost trends, and a sample of agent outputs. This catches drift that automated alerts miss.
You can build out from there—automated eval pipelines, drift detection, custom metrics—but these four steps cover the critical needs for any agent in production.
For building evaluation test suites, see How to Evaluate and Test AI Agents. For reducing the costs you are monitoring, read AI Agent Cost Optimization Guide. Explore the full AI Operations Agent niche for more production guides.