Loading…
Loading…
AI agent costs scale with volume—and volume scales fast once an agent works. A naive production agent often costs 5-10x what a well-tuned version costs for the same workload. This playbook covers the optimization patterns that compound to dramatic savings: most teams that apply all of them cut monthly LLM spend by 60-85% without measurable quality regression. Each pattern is a discrete change; pick the ones that apply to your architecture and ship them one at a time.
Written by Max Zeshut
Founder at Agentmelt
Not every step needs the most capable model. Use a [[model-router]] to send simple sub-tasks (intent classification, entity extraction, summarization of short text) to a fast cheap model (Haiku, GPT-4o-mini, Gemini Flash) and reserve the expensive model (Opus, GPT-5, Claude Sonnet 4.x) for the steps that actually require its capabilities. Properly applied, this alone cuts costs 50-70% on most agents. Implementation: a small classifier or a heuristic decides per request; cascade to the bigger model only when confidence is low.
If your agent has a stable system prompt, tool definitions, or reference documents that don't change per request, [[prompt-caching]] is mandatory. Anthropic, OpenAI, and most major providers now support it—you mark the cacheable prefix and pay a fraction of the input-token cost on cache hits. For a high-volume agent with a 15K-token system prompt, prompt caching commonly drops total cost 60-85%. The implementation is usually 5-20 lines of code; ROI is hours, not weeks.
The cheapest token is the one you don't send. Audit what's in your context window on a typical agent run. Common waste: retrieving 20 chunks when 5 would do, sending the full document when the relevant section would suffice, including conversation history that no longer informs the current step. Add [[reranking]] to improve retrieval precision (lets you send fewer chunks without losing quality), prune older turns from multi-turn conversations, and chunk documents intelligently so retrieval can return exactly what's needed. Most agents have 30-50% of their context-window tokens going to material the model never uses.
When a step requires loading a lot of context that the parent agent doesn't need to remember (exploring a codebase, reviewing a long document, doing deep research), spin up a [[subagent]] for that step. The subagent has its own context window, does its work, and returns a compact summary. The parent agent's context stays lean, prompt caching keeps working, and total tokens drop dramatically. This is the single biggest win for long-running agents like coding assistants and research workflows.
Most major LLM providers offer a batch API that runs offline within 24 hours at roughly 50% the per-token cost. Anything that doesn't need real-time (overnight enrichment, weekly report generation, bulk content drafts, eval runs) should be batched. Engineering investment: small; cost saved: 50% on the batched portion. The trap is over-batching—if a task feels real-time to the user, the cost savings aren't worth the user experience hit.
Distinct from prompt caching, response caching stores the entire model output for identical inputs (or near-identical, using embedding-based similarity). For agents with skewed query distributions—where the top 5% of distinct questions account for 40% of traffic—response caching can eliminate that portion of calls entirely. Standard for support agents (FAQ-like volume), less useful for personalized agents (every query is different). Always invalidate the cache on knowledge-base updates so customers don't get stale answers.
Every optimization above can subtly degrade quality. Before you ship any of them, snapshot your eval set scores. Apply the optimization. Run the eval. Look at the diff. If quality drops more than your acceptance threshold (typically 1-2 percentage points), back out or tune. The teams that cut costs successfully treat each optimization as an experiment with a measurable quality contract—not as a code change that ships when it looks right.
Choose architectures that make optimization easier. Stateless agents (each request is independent) are easier to cache and route than stateful conversational agents. Tool-heavy agents (most work happens outside the LLM) are usually cheaper than long-context-heavy agents. Smaller, focused agents composed via [[subagent]] delegation are cheaper than monolithic agents that try to do everything in one prompt. Architecture decisions made in the first month determine the ceiling on what optimization can achieve later—and they're the hardest to change once traffic is real.
Model routing and prompt caching. Together they typically account for 70-80% of available savings, and both are quick to implement. Retrieval tuning and subagent isolation come next (each is moderately involved). Batch processing, response caching, and architecture changes are higher-effort and pay back over longer time horizons.
See the [[agent-cost-per-task]] glossary entry. Capture per-run: tokens in, tokens out, tool API costs, infrastructure, and human-review minutes converted to dollar terms. Tag every run with the agent version and the customer segment so you can decompose cost by dimension. Most observability platforms (LangSmith, Braintrust, Logfire) handle this automatically once configured.
When the engineering cost of the next optimization exceeds 6-12 months of savings at current volume. For a $5K/month agent, spending two engineer-weeks (~$20K loaded) to save 10% ($500/month) is a 3-year payback—not worth it. For a $50K/month agent, the same effort to save 10% pays back in 4 months. Re-evaluate cost optimization annually as both volume and engineer rates change.