01
START HEREThe 30-second version, then the actual mechanics
People building multi-step Claude agents keep hitting the same wall: the cost doesn't scale linearly with the work — it scales with the square of the conversation. A 10-step agent doesn't cost 10× a single step; it can cost far more. This guide is the why and the three fixes, all real Claude/Claude Code features you can paste today.
The root cause is one fact most people skip past: the Claude API is stateless. It has no memory of your conversation between calls. So every time your agent takes a step, your code re-sends the entire transcript so far — the system prompt, every prior message, and every prior tool result — just so the model has the context to answer. Step 2 re-pays for step 1. Step 10 re-pays for steps 1 through 9. That re-payment is where the tokens go.
The community has been loud about the bill — people report eye-watering single-run costs (more on that, honestly, below). The good news: there are three documented levers that attack the re-payment directly. Skim the table, then steal the snippets.
- The cause: stateless API → every step re-sends the full prior context at full price
- Fix 1 — shrink what you carry: context editing (prunes stale tool results) + compaction (summarizes when near the limit)
- Fix 2 — stop re-paying for what's stable: prompt caching — cache reads cost ~0.1× of full input price
- Fix 3 — fan out correctly: parallel calls can't read a cache that's still being written — send one, await its first token, then fire the rest
Everything below is a real Claude/Claude Code feature, verified against Anthropic's own API docs. No invented parameters. Where a number is illustrative (the worked example), it's labeled as such with the arithmetic shown — your numbers will vary.
02
THE ROOT CAUSEWhy multi-step agents blow tokens: the stateless tax
Here's the part that surprises people. When you call Claude, the model doesn't remember the last call. The API is stateless — each request is self-contained. For a chatbot that's fine; for a multi-step agent it's the whole problem.
To give the model the context to take step N, your code (or the agent harness) sends the complete conversation up to that point: the system prompt, every user and assistant message, and crucially every tool result from every prior step — file reads, search dumps, API responses, the lot. Those tool results are usually the heaviest part. So the input you pay for grows with every single step, and you pay for the accumulated history again on each one.
That's the 'stateless tax.' It's not a bug and it's not avoidable in the raw API — it's how stateless inference works. What you can control is how big that re-sent context is, and how much of it you pay full price for. The three patterns below are exactly those two levers: make the carried context smaller, and make the unchanged part of it cheap.
| What gets re-sent every step | Why it's there | What it costs you |
|---|
| System prompt + tool definitions | The model needs its instructions and tool schemas on every call | Re-paid full price every step unless cached |
| All prior messages | Stateless API — no server-side memory of the conversation | Grows linearly; re-paid every step |
| All prior tool results | The model reasons over what earlier steps found | Usually the biggest chunk; grows fastest; re-paid every step |
| The new instruction for this step | The only genuinely new tokens | Small — this is the part you'd want to be paying for |
Mental model: a naive 10-step agent re-sends roughly the same large context 10 times. The patterns below either shrink that context (Pattern 1) or make most of it read from cache at ~0.1× (Pattern 2).
03
PATTERN 1 · PRUNE & SUMMARIZEPattern 1 — Shrink the carried context between steps
If every step re-sends the whole transcript, the first lever is obvious: make the transcript smaller. Claude gives you two distinct, real mechanisms for this — and they do different things, so use the right one.
Context editing prunes. As turns accumulate, it clears stale content — old tool results and completed thinking blocks — out of the transcript based on configurable thresholds. It does not summarize; it removes. That depth-3 file dump from step 2 that nothing references anymore? Gone, so you stop re-sending it. (Docs: build-with-claude/context-editing.)
Compaction summarizes. When the conversation approaches the context-window limit, the server condenses earlier history into a compaction block, server-side. The critical, easy-to-miss rule: you must append the response's full content (response.content) back onto your messages each turn — not just the text. The compaction block lives in that content, and the API uses it to replace the compacted history on the next request. Extract only the text and you silently lose the compaction state. (Beta header compact-2026-01-12.)
Claude Code does compaction automatically — that 'compacting conversation' you see is this, native. If you're building on the raw API, here's the conceptual shape.
- Enable compaction with the beta header and a context-management edit on the request
- On every turn, append the WHOLE response.content back to messages (compaction block included) — not just block.text
- Optionally add context editing to prune stale tool results as turns pile up — pruning + summarizing compose well
- Verify it's working: a long run whose carried context stops growing (or shrinks) is the tell
Conceptual code shape (Python SDK):
messages.append({"role": "user", "content": user_msg})
resp = client.beta.messages.create(
betas=["compact-2026-01-12"],
model="claude-opus-4-8",
max_tokens=16000,
messages=messages,
context_management={"edits": [{"type": "compact_20260112"}]},
)
# APPEND FULL CONTENT — the compaction block must be preserved:
messages.append({"role": "assistant", "content": resp.content})
Appending only the text string is the #1 way this silently breaks.
04
PATTERN 2 · PROMPT CACHINGPattern 2 — Cache the expensive, stable context
Pattern 1 shrinks the context. Pattern 2 makes the part you do keep re-sending cheap. This is the single highest-leverage move for multi-step agents.
Prompt caching lets Claude reuse a previously-processed prefix instead of reprocessing it. You mark a content block with cache_control: {type: "ephemeral"} and the API caches everything up to that point. On the next request with the same prefix, those tokens are served from cache. Cache reads cost roughly 0.1× of the full input price — about a 90% discount on the cached portion. Writes cost ~1.25× (a one-time premium, 5-minute TTL), so by the second request you're already ahead.
The one rule that governs everything: caching is a prefix match. The cache key is the exact bytes from the start of the prompt up to your breakpoint. Any single byte change anywhere in that prefix invalidates the cache for everything after it. Render order is tools → system → messages, so put your stable content first (frozen system prompt, deterministic tool list, prior tool results) and your volatile content (this step's new question, timestamps, per-request IDs) last — after the last breakpoint.
For a multi-step agent the payoff is direct: the giant accumulated history is mostly stable from one step to the next, so it reads from cache at ~0.1× instead of being re-paid at full price every step.
- Put a cache_control breakpoint on the last STABLE block (end of system prompt, or end of the accumulated history you'll reuse)
- Keep volatile content AFTER it — this step's new instruction, anything with a timestamp or per-request ID
- VERIFY it's working: read usage.cache_read_input_tokens on the response. If it's > 0 across repeated requests, the cache is hitting
- If cache_read_input_tokens stays 0 across identical-prefix requests, a silent invalidator is changing your prefix — hunt it (see note)
Copy-paste breakpoint (Python SDK):
resp = client.messages.create(
model="claude-opus-4-8",
max_tokens=16000,
system=[{
"type": "text",
"text": LARGE_STABLE_CONTEXT, # frozen prompt / prior tool results
"cache_control": {"type": "ephemeral"},
}],
messages=[{"role": "user", "content": THIS_STEPS_QUESTION}], # volatile, last
)
print(resp.usage.cache_read_input_tokens) # > 0 means it hit
SILENT CACHE INVALIDATORS (each changes the prefix and quietly kills the cache):
• datetime.now() / a UUID interpolated into the system prompt → prefix differs every request
• json.dumps(...) without sort_keys=True → key order varies → bytes differ
• A changing tool set (tools render at position 0, so adding/removing/reordering a tool invalidates the WHOLE cache)
Minimum cacheable prefix is ~1024+ tokens (model-dependent) — shorter prefixes silently won't cache.
05
PATTERN 3 · FAN OUT CORRECTLYPattern 3 — Parallel vs sequential: the cache-timing trap
Once you're caching, parallelism gets a non-obvious gotcha that quietly costs you the entire discount.
Fanning work out is great when the calls are read-only and independent — many agents (Claude Code included) run parallel-safe tools like search and file-reads concurrently. The trap is caching: a cache entry only becomes readable after the first response that wrote it begins streaming. So if you fire N concurrent requests that all share the same big prefix, none of them can read a cache the others are still in the middle of writing — they all race, they all miss, and every one of them pays the full write price. You wanted N cache reads and got N cache writes.
The fix is a tiny bit of sequencing at the front. Send one request first. Await its first streamed token — not the whole response, just the first token, which is the signal that the cache has begun being written. Then fire the remaining N−1. Those now read the cache the first one wrote, at ~0.1×. You keep almost all the parallelism and pay for the shared prefix once.
- Send request #1 alone with your cache_control breakpoint on the shared prefix
- Await ONLY its first streamed token (the cache-write has begun — you don't need the full response)
- Fire the remaining N−1 requests concurrently — they read the warm cache at ~0.1×
- Confirm: requests 2..N should show usage.cache_read_input_tokens > 0; request #1 shows cache_creation_input_tokens (the one write)
Naive fan-out: 8 identical-prefix calls fired at once → 8 cache WRITES (~1.25× each), zero reads. Sequenced fan-out: 1 write + 7 reads (~0.1× each) on the shared prefix. Same parallelism, a fraction of the prefix cost. This timing behavior is documented, not a heuristic.
06
WORKED EXAMPLE (ILLUSTRATIVE)Before / after: the token math, shown honestly
Here's a concrete, illustrative example with the arithmetic exposed so you can reproduce the logic on your own workload. This is NOT a guaranteed result — it's a model to reason with. Real numbers depend on how much of your context is stable, your step count, and how heavy your tool results are. Your numbers will vary.
Setup: a 10-step agent carrying a ~20K-token context (system + tools + accumulated history/tool results). We'll count input tokens processed at full price, in 'full-price-token' units (so a cache read at 0.1× counts as 0.1 units per token). We ignore the small new-instruction tokens per step — they're the same in every scenario and tiny next to the 20K.
Naive (no patterns): every step re-sends ~20K at full price. 10 steps × 20K = ~200K full-price tokens. This is the stateless tax in full.
With prompt caching: step 1 writes the ~20K to cache at ~1.25× = ~25K units. Steps 2–10 read that ~20K from cache at ~0.1× = 9 × 2K = ~18K units. Total ≈ 25K + 18K = ~43K units — roughly a 4–5× reduction versus naive, just from caching the stable prefix. (As later steps add new stable history, you re-cache incrementally; the shape holds.)
Add context editing: if pruning stale tool results shrinks the carried context from ~20K toward, say, ~12K over the run, every per-step number above scales down with it — both the writes and the reads get cheaper because there's simply less to carry. Caching and pruning compound: one makes the context cheap, the other makes it small.
The takeaway isn't the specific figure — it's the direction and magnitude: the stateless tax is real and roughly quadratic, and the cached + pruned path turns most of it into a ~0.1× line item.
| Scenario (10 steps, ~20K context) | Roughly how it's counted | Full-price-token units (illustrative) |
|---|
| Naive — re-send everything full price | 10 × 20K | ~200K |
| + Prompt caching the stable prefix | 1 write (~1.25×) + 9 reads (~0.1×) | ~43K |
| + Context editing (context shrinks ~20K→~12K) | Same shape, smaller base each step | Lower again — scales with the smaller context |
Reproduce it yourself before trusting any number: call client.messages.count_tokens(...) to size your real context, then read usage.cache_read_input_tokens / cache_creation_input_tokens / input_tokens off real responses to see the actual split. The arithmetic above is a reasoning tool, not a benchmark.
07
FACT-CHECK YOURSELFThe honest cost footnotes (read before you quote a number)
Two things get conflated constantly in the community, and getting them wrong makes your cost reasoning worthless. Worth 60 seconds to get straight.
The scary single-run figures are other people's reports, not a law. You'll see community posts about a single prompt costing a fortune, or one run burning hundreds of thousands of tokens. Treat those as community estimates and anecdotes — they depend entirely on that person's context size, step count, and effort settings. They are not a guaranteed outcome, and they're not ours. We cite them only to show the direction the stateless tax pushes costs; the patterns here are how you push back.
The '$200' is a subscription plan, not a per-token price — don't mix them. Claude Max is a consumer subscription (a flat monthly plan, $200/mo tier). API pricing is completely separate and is per token: for Claude Opus that's on the order of $5 per 1M input tokens and $25 per 1M output tokens. If you're building on the API, your bill is driven by tokens — which is exactly why the three patterns above matter — not by the consumer plan price. Conflating 'the $200 plan' with 'API cost' will make every estimate you do nonsense.
So: attribute the pain to the community, frame the big figures as reported-not-guaranteed, reason in per-token API pricing, and verify with usage on real responses. Then the numbers are yours and they're honest.
- Big single-run costs you've seen = community-reported estimates, not guaranteed and not ours — they show direction, not a promise
- $200/mo = Claude Max consumer SUBSCRIPTION (a plan price), distinct from API per-token billing
- API pricing is per token — Opus is ~$5 / $25 per 1M input / output — so token control IS cost control on the API
- Always verify your own split with count_tokens() up front and usage.cache_read_input_tokens / input_tokens on responses
Every pattern in this guide is a real, documented Claude feature — context editing, compaction, prompt caching, and the cache-timing behavior for parallel calls. No invented parameters, no model claims. Where a figure is illustrative it says so.
Get the next drop
New AI build guides + the occasional bonus template. No spam, unsubscribe anytime.
By submitting you agree to our Privacy Policy & Terms. Unsubscribe anytime.
You're in — check your inbox to confirm.
Frequently asked questions
Why does a 10-step agent cost so much more than 10 single calls?
Because the Claude API is stateless — it keeps no memory of your conversation between calls. To take step N, your code re-sends the entire transcript so far: the system prompt, every prior message, and every prior tool result. So step 2 re-pays for step 1, step 10 re-pays for steps 1–9, and the heaviest part (tool results) grows fastest. The cost scales closer to the square of the conversation than linearly. The fixes are to shrink that carried context (context editing + compaction) and to stop paying full price for the stable part of it (prompt caching).
What's the difference between context editing and compaction?
Context editing PRUNES — it clears stale tool results and completed thinking blocks out of the transcript based on configurable thresholds, so you stop re-sending content nothing references anymore. Compaction SUMMARIZES — when you near the context-window limit, the server condenses earlier history into a compaction block. They compose well: pruning makes the context smaller, summarizing handles what's left near the limit. With compaction you MUST append the full response.content (not just the text) back each turn, or you silently lose the compaction block.
How much does prompt caching actually save?
Cache READS cost roughly 0.1× of the full input price — about a 90% discount on the cached portion. The trade-off is the WRITE: ~1.25× for the standard 5-minute TTL. So the first request pays a small premium to write the cache, and from the second identical-prefix request onward you're ahead. For a multi-step agent, where most of the re-sent context is stable from step to step, that means most of your input reads at ~0.1× instead of full price. Verify it with usage.cache_read_input_tokens on the response.
I added cache_control but cache_read_input_tokens is still zero. Why?
A silent invalidator is changing your prefix. Caching is a prefix match — any byte change anywhere before your breakpoint invalidates everything after it. The usual culprits: datetime.now() or a UUID interpolated into the system prompt (prefix differs every request); json.dumps() without sort_keys=True (key order varies, so the bytes differ); or a changing tool set (tools render at position 0, so adding, removing, or reordering one tool invalidates the whole cache). Also check your prefix is long enough — the minimum cacheable prefix is ~1024+ tokens depending on the model, and shorter ones silently won't cache.
Can I just fire all my parallel agent calls at once to save time?
You can for the latency, but if they share a cached prefix you'll lose the cache discount. A cache entry is only readable AFTER the first response that wrote it begins streaming. Fire N identical-prefix calls simultaneously and none of them can read a cache the others are still writing — they all miss and all pay the full write price. The fix: send one request first, await just its first streamed token (the cache-write has begun), then fire the remaining N−1. Those read the warm cache at ~0.1×. You keep the parallelism and pay for the shared prefix once.
Is the '$200 per run' thing real?
Be careful — two different things get conflated. The big single-run cost figures floating around are community-reported estimates: they depend on that person's context size, step count, and settings, and they're not guaranteed or universal. Separately, $200/mo is the price of the Claude Max CONSUMER SUBSCRIPTION — a flat plan — which is NOT the same as API pricing. API billing is per token (Claude Opus is on the order of $5 per 1M input and $25 per 1M output tokens). If you build on the API, your cost is driven by tokens, which is exactly why controlling them matters. Don't mix the subscription price with per-token API cost.
Does Claude Code do any of this for me automatically?
Yes — Claude Code does compaction natively (the 'compacting conversation' step you see is exactly this), and it runs parallel-safe read-only tools concurrently. The patterns in this guide are most directly actionable when you're building your OWN agent on the raw API, where you control the message list, the cache_control breakpoints, and the fan-out sequencing. Even inside Claude Code, understanding the stateless tax helps you keep contexts lean so auto-compaction triggers less often.
What's the single highest-leverage change I can make today?
Add a cache_control: {type: 'ephemeral'} breakpoint on the last STABLE block of your prompt — the end of your frozen system prompt or the accumulated history you reuse each step — and keep this step's new, volatile content after it. Then read usage.cache_read_input_tokens to confirm it's hitting. That one change turns the largest, most-repeated part of your context from a full-price line item into a ~0.1× one, which is where most of the stateless tax lives.