Context Engineering for AI Agents: The 2026 Stack That Replaces Prompt Engineering
In: AI Coding Agent
Written by Max Zeshut
Founder at Agentmelt · Last updated Jun 17, 2026
TL;DR: "Context engineering" is the term that's quietly replaced prompt engineering in 2026 — coined by Andrej Karpathy and Tobi Lütke as a way to describe what actually makes agents reliable. The shift: a clever one-shot prompt doesn't carry a multi-turn agent; what carries it is everything else you put in the model's context window — system instructions, retrieved documents, tool results, prior steps, and persistent memory — assembled deliberately, in the right order, with the right compression. This post is the working stack: what context engineering is, the seven slots in a modern agent's context window, the four failure modes (poisoning, distraction, confusion, clash), and how production teams handle each one. For the broader loop these patterns plug into, see agentic loops.
What is context engineering?
Andrej Karpathy popularized the term in mid-2025: "context engineering" is the discipline of filling the context window with just the right information for the next step. Tobi Lütke, Shopify's CEO, used it around the same time — "the art of providing all the context for the task to be plausibly solvable by the LLM." Anthropic's engineering team made it formal in late 2025: context engineering is the natural evolution of prompt engineering once agents run in loops instead of one-shot.
The shift matters because the failure modes are different. With prompt engineering you debug one string. With context engineering you debug seven assembly steps — and a bug in any of them silently degrades the agent over the next twenty turns.
The seven slots in a modern agent's context window
Every production agent's context window is assembled from the same set of pieces. Anthropic, OpenAI, and the LangChain/LangGraph and LlamaIndex teams all converge on roughly this stack (context engineering write-ups: Anthropic, LangChain, Phil Schmid):
- System prompt — the role, the constraints, the tone-of-voice, and the high-level objective. Stable across turns.
- Tool definitions — JSON schemas for the tools the model can call. Stable across turns; large.
- Long-term memory — facts about the user, past decisions, and prior conversations, retrieved selectively (not dumped wholesale).
- Retrieved knowledge — chunks from a vector store, a SQL query result, a web fetch, or another agent. The "RAG" slot.
- Conversation history — prior user turns and assistant responses, often compressed once the turn count gets high.
- Scratchpad / working memory — intermediate thoughts, plans, and state the model has chosen to write down between steps. The "external memory" trick.
- The current step's instruction — the actual prompt for this turn, often shorter than any of the above.
Two of these (system prompt, tool definitions) you author once. The other five you assemble dynamically on every turn — and that assembly is the engineering.
Why this replaced prompt engineering
In a single-shot use case (translate this, summarize that), the prompt is the context. There's nothing else. In an agent loop — observe, reason, act, observe again — the prompt is one slot of many, and it's usually short. What dominates the model's behavior is what you've retrieved, what's in scratchpad, and how much of the prior conversation you've kept.
Drew Breunig catalogued the four ways long-context agents go wrong, and they map directly to context-window assembly mistakes:
- Context poisoning — a hallucination, error, or stale fact lands in scratchpad early and the agent treats it as ground truth for the rest of the run.
- Context distraction — so much accumulated context that the model loses the current goal in the noise. DeepMind's Gemini 2.5 paper observed this above ~100K tokens even on long-context models.
- Context confusion — irrelevant content (extra tools, off-topic docs) pulls the model toward unrelated actions.
- Context clash — two pieces of retrieved or remembered content disagree and the model picks the wrong one.
You don't fix these with a better prompt. You fix them with assembly rules.
The five techniques that move agents from "demo" to "reliable"
1. Write context, don't dump it
The naive RAG pattern is: search → top-k chunks → stuff them in. The result is a context window full of paragraphs the agent has to re-read every turn.
The pattern production teams use: an explicit scratchpad (LangGraph's "state," Claude's <thinking> tags, OpenAI's reasoning summaries). The agent writes down the conclusions it drew from each retrieval, not the raw text. Next turn, the scratchpad carries the conclusion in 50 tokens instead of 5,000. Anthropic's Claude 4 context-management writeup describes this as "memory tools" — let the agent decide what to keep.
2. Select context, don't include it
For long-term memory and tool definitions, retrieve before you include. Cursor and Replit (per their public engineering writeups) don't ship all available tools to the model — they retrieve the relevant subset for the current step. Same idea for memory: don't dump a user profile; retrieve the three facts that matter for this turn.
A useful rule: anything that's the same on every turn (system prompt, core instructions) goes once at the top. Anything that varies per turn (retrieved docs, scratchpad updates) goes near the bottom, closest to the current question.
3. Compress when the window crosses a threshold
Long-running agents will fill any window you give them. Two compression patterns dominate:
- Summarize the conversation tail. Once history crosses some threshold (say, 50% of the model's window), replace the oldest turns with a model-written summary. Claude Code and Cursor do this transparently; LangGraph exposes it as a node.
- Rewrite the scratchpad. Every N turns, ask the agent to rewrite its own scratchpad to keep only what's load-bearing. This is the "self-reflection on context" loop and it dramatically slows context distraction.
4. Isolate context across sub-agents
A multi-agent system that shares one context window between every sub-agent is a mess by turn 10. The pattern that works (Anthropic's multi-agent research system): each sub-agent has its own context, the orchestrator only sees the sub-agent's summary, and the orchestrator's window stays small. This is sub-agent context isolation, and it's the single biggest reason multi-agent systems started shipping in 2025.
5. Test context, don't just test prompts
Evals for context-engineered agents are different. You can't just test the prompt — you have to test what the model sees after the retrieval and assembly. The practical pattern: log every assembled context as a trace (LangSmith, Phoenix, Helicone, Anthropic's own console), score them, find the bad assemblies, and tighten the rules. The eval target shifts from "does the prompt work?" to "did we put the right things in the window?"
How this connects to the agentic loop
Context engineering is the what's-in-the-window discipline; the agentic loop is the what-happens-each-turn discipline. They're paired: the loop runs observe → reason → act, and context engineering decides what observe puts into the model's hands for the reason step. Every loop iteration is a fresh context-assembly problem.
The practical implication for builders: when an agent fails, the first question isn't "is the prompt wrong?" — it's "is the context wrong?" Did the right docs get retrieved? Did stale scratchpad poison the run? Did history compression drop the key fact? Most production debugging is context debugging.
What to do this week
If you're building agents and haven't moved to context engineering as a discipline, here's a tight plan:
- Log the assembled context for every turn. Just write it to a trace. You can't fix what you can't see.
- Inventory your seven slots. Which are stable? Which are dynamic? Which one is the biggest? (For most agents, it's retrieval — and most retrieval is too wide.)
- Add a scratchpad. Even a simple one — let the agent write down its plan and refer back to it.
- Compress conversation history at 50% window usage. Don't wait for distraction; preempt it.
- Run a context-eval pass. Pick ten failing traces, look at the assembled context, and write down what's wrong. The fixes will be obvious.
Further reading
- Anthropic Engineering — Effective Context Engineering for AI Agents
- LangChain — Context Engineering for Agents
- Drew Breunig — How Long Contexts Fail
- Phil Schmid — The New Skill in AI is Not Prompting, It's Context Engineering
- Karpathy on context engineering — post
For the operating model these patterns plug into, see the pillar guide on agentic loops. For the memory layer specifically, see AI agent memory. For when prompt engineering still applies, see prompt engineering for AI agents.
Get the AI agent deployment checklist
One email, no spam. A short checklist for choosing and deploying the right AI agent for your team.
[email protected]