· ai-tools / llm / context-engineering
Context engineering in 2026 — six patterns that work
Context engineering decides what your model sees at inference. Six patterns with code: ordering, caching, compaction, sub-agent isolation, and more.
By Ethan
2,376 words · 12 min read
Context engineering is more consequential than prompt engineering. The words you choose matter, but what the model has access to — and in what order — determines whether it can complete the task at all. Six patterns cover the overwhelming majority of failure modes in production LLM systems: context rot, position effects, redundant token costs, overflow in long conversations, unbounded orchestrator memory, and retrieval-at-load-time inefficiency. This article covers all six with working code.
Who this is for
Developers building or operating LLM-based systems who are hitting quality or cost ceilings. If you are still experimenting with basic chat prompts, the background section will be useful. If you are running agents in production, skip to the patterns.
What context engineering actually is
Andrej Karpathy defined it in June 2025: “the delicate art and science of filling the context window with just the right information for the next step.” Shopify CEO Tobi Lutke offered the practitioner version: “the art of providing all the context for the task to be plausibly solvable by the LLM.”
Anthropic formalized the term in a September 2025 engineering post: “the set of strategies for curating and maintaining the optimal set of tokens (information) during LLM inference.”
The distinction from prompt engineering matters. Prompt engineering is about how you phrase a request. Context engineering is about what information the model has access to when it processes that request — the documents, the history, the tools, the retrieved facts, the agent notes. Getting the phrasing right on the wrong context will not save you.
Context rot: why bigger is not better
The attention mechanism computes every token against every other token. As context length grows, the useful signal gets diluted. Anthropic calls the resulting quality degradation “context rot.”
The empirical underpinning comes from Liu et al. (2023, “Lost in the Middle: How Language Models Use Long Contexts”). Testing multi-document QA and key-value retrieval across models with 4K–32K context windows, they found a U-shaped performance curve: recall is highest when relevant information appears at the very beginning or end of the context window, and drops significantly in the middle. The models tested most of the documents in between — they just paid less attention to them.
Anthropic’s Claude context documentation states the goal explicitly: “the smallest possible set of high-signal tokens.” Adding more context to compensate for weak retrieval or absent summarization compounds the problem.
Modern long-context models (Claude Opus 4.8 and Fable 5 support up to 1M tokens) do not eliminate context rot; they raise the ceiling at which it sets in. The discipline of managing what goes in still applies.
Position rule: query at the end
Anthropic’s internal testing shows placing the query after long documents improves response quality by up to 30% in multi-document tasks. The U-shaped attention curve explains why: information at the beginning and end gets the most attention. If your instruction is at the top and your 50-page document follows, the query is in a more attended position but the document is not. Flip it.
import anthropic
client = anthropic.Anthropic()
# Documents at the top, query at the bottom
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=[
{
"role": "user",
"content": """<documents>
<document index="1">
<document_content>
[Your 50-page reference document here]
</document_content>
</document>
<document index="2">
<document_content>
[Second reference document here]
</document_content>
</document>
</documents>
Based on the documents above, which sections describe the liability conditions?""",
}
],
)
print(response.content[0].text)
XML tags (<documents>, <document index="n">, <document_content>) separate content types so the model does not treat the retrieved text as part of your instruction. The query comes last, after the documents.
This directly reverses the naive habit of writing instructions first and context second. Both inputs remain in the context window — the difference is purely positional.
Prompt caching: context as a cost lever
Repeated context — a large system prompt, a reference document, a set of few-shot examples — costs the same on every request unless you cache it. With cache_control, the Anthropic API stores the key-value tensor representation of a prefix. Cache reads cost 0.10× the normal input token price; cache writes cost 1.25×. On workloads where the static prefix exceeds a few thousand tokens and you make multiple calls per session, caching pays off quickly.
On workloads with a large stable prefix, cache reads cost 0.10× the base input price — a 90% discount — and subsequent requests arrive significantly faster because the KV tensors are already stored.
import anthropic
client = anthropic.Anthropic()
LARGE_SYSTEM_PROMPT = """You are an expert analyzing legal contracts.
[... your 20,000-token reference document here ...]
"""
# First request — cache write (pays 1.25× base on cached tokens)
response1 = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=[
{"type": "text", "text": "You are a contract analyst."},
{
"type": "text",
"text": LARGE_SYSTEM_PROMPT,
"cache_control": {"type": "ephemeral"}, # cache breakpoint
},
],
messages=[{"role": "user", "content": "Summarize the termination clauses."}],
)
print("Cache write tokens:", response1.usage.cache_creation_input_tokens)
print("Cache read tokens:", response1.usage.cache_read_input_tokens) # 0 on first call
# Second request (within 5 minutes) — cache read (pays 0.10× base)
response2 = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=[
{"type": "text", "text": "You are a contract analyst."},
{
"type": "text",
"text": LARGE_SYSTEM_PROMPT, # must be byte-identical
"cache_control": {"type": "ephemeral"},
},
],
messages=[{"role": "user", "content": "List the indemnification provisions."}],
)
print("Cache read tokens:", response2.usage.cache_read_input_tokens) # > 0 on hit
What breaks caching: any byte change in the prefix — a timestamp embedded in the system prompt, per-request user IDs placed before the breakpoint, a trailing space that varies. Audit your prompt construction code before assuming the prefix is stable. Concurrent first requests both miss the cache; warm with one serial call before parallel traffic.
Default TTL is 5 minutes; a 1-hour TTL is available at 2× the base input price. The 5-minute default resets on each cache hit, so active sessions stay warm at no extra cost.
For a deeper comparison of caching across Anthropic, OpenAI, and Gemini, see prompt caching in 2026.
Six patterns for managing context
1. Compaction
When a session’s token count approaches the model’s capacity, summarize the history and reinitialize with the summary as the new context. The model loses verbatim recall but retains semantic continuity.
import anthropic
client = anthropic.Anthropic()
def summarize_conversation(messages: list[dict]) -> str:
summary_prompt = f"""Summarize the following conversation, preserving all decisions,
findings, and open questions. Be precise — this summary becomes the only record.
{messages}"""
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=2048,
messages=[{"role": "user", "content": summary_prompt}],
)
return response.content[0].text
def manage_context(messages: list[dict], token_threshold: int = 80000) -> list[dict]:
# Rough estimate: 1 token ≈ 4 characters
approx_tokens = sum(len(str(m)) // 4 for m in messages)
if approx_tokens > token_threshold:
summary = summarize_conversation(messages[:-4]) # keep last 2 turns verbatim
return [
{"role": "user", "content": f"[Previous conversation summary]\n{summary}"},
{"role": "assistant", "content": "Understood. Continuing from where we left off."},
] + messages[-4:]
return messages
# Usage in a multi-turn loop
conversation = []
while True:
user_input = input("You: ")
if user_input.lower() == "quit":
break
conversation.append({"role": "user", "content": user_input})
conversation = manage_context(conversation)
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=conversation,
)
assistant_msg = response.content[0].text
conversation.append({"role": "assistant", "content": assistant_msg})
print(f"Assistant: {assistant_msg}")
Failure mode: Summarization is lossy. If the conversation contains a specific number, a quoted phrase, or a prior decision that will matter later, the summary may drop it. Either keep recent turns verbatim (as the example does with messages[-4:]) or write structured summaries that explicitly preserve decisions and open questions.
2. Structured external memory
For tasks that span context resets — multi-day projects, long-running agents — use a file as persistent memory. Write findings there, reinitialize the context from the file, and continue.
The pattern Anthropic’s own agent framework uses: a NOTES.md file the agent reads at the start of each context window and updates at the end. At reset, the new context starts from NOTES.md rather than from nothing.
The format matters. A file full of prose is hard to query. A file with dated, typed entries (## Decision: 2026-06-10: chose Postgres over SQLite because...) lets you or the model find a prior decision without reading the whole file.
3. Sub-agent isolation
The orchestrator’s context is the most expensive real estate in a multi-agent system. Each sub-agent call that returns a 10,000-token dump adds to a running total the orchestrator must carry. Sub-agent isolation contains that cost: the sub-agent gets a fresh, narrow context; it returns a brief summary; the orchestrator appends only the summary.
import anthropic
client = anthropic.Anthropic()
def run_sub_agent(task: str, context: str) -> str:
"""Sub-agent gets its own clean context. Returns a brief summary."""
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=512,
system="You are a research assistant. Return a concise summary of your findings — 200 words max.",
messages=[
{
"role": "user",
"content": f"Task: {task}\n\nAvailable context:\n{context}",
}
],
)
return response.content[0].text
# Orchestrator maintains a lean context
orchestrator_context = []
# Delegate research; sub-agent gets targeted context, not the whole conversation
research_summary = run_sub_agent(
task="Find all references to rate-limiting in the provided API spec",
context="[Paste the relevant API spec section here — not the full conversation history]",
)
# Orchestrator appends only the 200-word summary
orchestrator_context.append({
"role": "user",
"content": f"Research findings on rate-limiting:\n{research_summary}",
})
# Continue orchestrator work with a context that grows by ~200 tokens, not ~10,000
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system="You are an API integration architect.",
messages=orchestrator_context + [
{"role": "user", "content": "Based on the findings, what retry strategy should we implement?"}
],
)
print(response.content[0].text)
Failure mode: The sub-agent summary loses detail. If the orchestrator later needs the verbatim rate-limit headers or exact error codes, a 200-word summary will not have them. Decide upfront what the orchestrator will need and ask for it explicitly in the sub-agent’s task.
For a complete TypeScript implementation of an agent with tools and memory, see how to build an AI agent in TypeScript — tools, memory, MCP.
4. Just-in-time retrieval
Loading a 500-page manual into the context at the start of every conversation costs tokens whether or not the conversation ever needs it. JIT retrieval loads information via tools at the moment it becomes relevant.
The pattern: give the model a search tool (get_manual_section(query: str)) instead of the full manual. On most questions, the model calls the tool once and retrieves the relevant section. The full document never enters context. For RAG-based systems, this is the mechanism; the context engineering discipline is deciding what to pre-load versus what to retrieve on demand.
When to pre-load: information the model will almost certainly need (configuration, user profile, current task). When to retrieve on demand: reference material, documentation, historical records.
5. Sliding window
For streaming conversations where recency matters more than history, keep the last k message pairs in context and drop the rest.
This mirrors ConversationBufferWindowMemory from LangChain 0.0.x (deprecated in v0.3) and is equivalent to maintaining a capped messages deque in your own code. The tradeoff: the model loses awareness of decisions made before the window. Use compaction (pattern 1) if you need semantic continuity; use a sliding window if you need recency and can tolerate amnesia about older turns.
6. Summarization middleware
Rather than compacting on overflow, run a background summarizer at a configurable token threshold — LangChain’s SummarizationMiddleware accepts a trigger={"tokens": N} parameter; when the conversation exceeds that token count, it automatically summarizes older messages (configurable via keep={"messages": N} for how many recent turns to preserve verbatim). The middleware handles the replacement without you instrumenting the conversation loop.
The advantage over manual compaction: the threshold is model-aware (it knows when to fire, not you), and the summary is appended as a system note rather than injected as a fake user message. The disadvantage: you give up control over what the summary retains.
RAG is one mechanism, not the whole discipline
Retrieval-augmented generation (RAG) is a technique for populating context at runtime — fetch relevant documents from a vector store and include them in the prompt. It is a solid implementation of JIT retrieval (pattern 4 above), but it addresses one dimension of context engineering.
Context engineering also covers: system prompt design, tool-output trimming, conversation pruning, multi-agent state passing, caching strategy, and context ordering. A RAG pipeline that dumps 20 irrelevant chunks into the middle of the context window violates three of the patterns above simultaneously.
The conflation is common because RAG is where most teams hit their first context ceiling. The ceiling is a context engineering problem; RAG is one of several solutions.
What to track in production
The cache and context metrics that tell you whether these patterns are working:
- Cache hit rate:
usage.cache_read_input_tokens / (usage.cache_read_input_tokens + usage.cache_creation_input_tokens)per request. Below 60% on prompts you expect to be stable means something is breaking cache keys. - Context utilization: how many tokens of context you are actually sending versus the model’s limit. High utilization is not a problem unless quality is degrading; it is a signal to consider compaction.
- Middle-of-context placement: if your retrieval system is placing the most relevant chunk in position 50 of 100, it is in the lowest-attention zone. Sort retrieved chunks with most-relevant first or last.
Observability tools like LangSmith surface these metrics per session without manual instrumentation. For cost attribution across context patterns, see the real cost of running an AI agent team in 2026. For reducing costs at the model-selection level, see LLM cost routing: when Haiku beats Opus.
Verdict
Context engineering is the discipline that separates working LLM prototypes from reliable production systems. The six patterns are not independent features to add one at a time — they compose. A production agent typically needs sub-agent isolation (3) to keep the orchestrator lean, JIT retrieval (4) to avoid preloading everything, compaction (1) for long sessions, and caching (see caching section) to control costs. The position rule and XML structuring are low-effort improvements you can apply to existing prompts today.
The Anthropic API documentation covers the current context window limits, cache_control syntax, and the MRCR/GraphWalks benchmarks Anthropic uses internally to measure long-context recall quality.
Primary sources
- Karpathy on context engineering: x.com/karpathy/status/1937902205765607626
- Simon Willison on the term’s origin: simonwillison.net/2025/Jun/27/context-engineering/
- Anthropic engineering post: anthropic.com/engineering/effective-context-engineering-for-ai-agents
- Anthropic context windows docs: docs.anthropic.com/en/docs/build-with-claude/context-windows
- Anthropic prompt caching docs: docs.anthropic.com/en/docs/build-with-claude/prompt-caching
- Anthropic prompting best practices: docs.anthropic.com/en/docs/build-with-claude/prompt-engineering/claude-4-best-practices
- Liu et al. (2023): semanticscholar.org/paper/Lost-in-the-Middle
- LangChain context engineering docs: docs.langchain.com/oss/python/langchain/context-engineering
- Survey of context engineering (arXiv 2025): arxiv.org/pdf/2507.13334