Loading…
Loading…
Written by Max Zeshut
Founder at Agentmelt
An inference optimization where the processed representation of static prompt content (system instructions, tool definitions, reference documents) is cached between requests so the model skips re-processing it on subsequent calls. Context caching is distinct from response caching—it caches the input processing, not the output. For agents with large, stable system prompts (common in production deployments), context caching reduces per-request latency by 30–60% and cost by 50–90% on the cached portion. Supported by Anthropic (prompt caching), Google (context caching), and OpenAI (automatic caching).
A support agent has a 12,000-token system prompt with guardrails, tool definitions, and company policies. Without caching, every ticket processes all 12,000 tokens. With context caching, the prompt is processed once and reused for subsequent requests—cutting per-ticket cost from $0.036 to $0.004.