Loading…
Loading…
Written by Max Zeshut
Founder at Agentmelt
The total time from when a user sends a request to when the AI agent delivers its final response or completes its action—encompassing LLM inference time, tool execution time, retrieval latency, and any intermediate processing. Agent latency is a critical UX and adoption metric: users tolerate different latencies depending on context (sub-second for chat, 2-5 seconds for complex queries, minutes for background tasks). Optimizing agent latency involves model selection (smaller models for simple tasks), caching (semantic and exact-match), parallel tool execution, streaming responses, and architecture choices (local vs. cloud inference).
A support agent's latency breaks down as: retrieval from knowledge base (200ms) + LLM inference (800ms) + CRM lookup (300ms) + response streaming start (50ms) = 1,350ms to first token. The team reduces this to 600ms by running retrieval and CRM lookup in parallel (300ms), switching to a faster model for simple queries (400ms), and implementing semantic caching for the top 100 questions (cache hit: 50ms total).