Plan Mode is the Deal-Breaker: Why Direct-Mode Coding with Agents Wastes Tokens

Listen to this article

Summary
  • Plan mode in Claude Code (toggled via Shift+Tab on macOS or Alt+M on Windows) prevents file edits and instead produces a markdown plan covering the approach, files to change, tradeoffs, and open questions before any code is written.
  • Direct mode wastes tokens through misunderstood intent, context window degradation past 200K tokens (where Anthropic charged premium pricing and MRCR v2 accuracy drops from 93% at 256K to 76% at 1M), and lack of durable reasoning artifacts.
  • The Research, Plan, Implement pattern from Boris Cherny separates work into three phases with fresh context windows, producing spec.md, design.md, and tasks.md files that survive sessions and enable parallel agents.
  • Recommended practices: make plan mode the default for changes touching more than two files, commit approved plans to the repo as versioned artifacts, and measure cost per feature shipped rather than tokens per session.

A few weeks ago I sat with one of our senior engineers while he debugged a project that had gone sideways. The team was shipping a feature with Claude Code. The feature had been “almost done” for three days, and the agent kept producing code that compiled, passed lint, and did not do what the PRD asked for. They had burned through more tokens in that feature than the previous five features combined.

I asked him to show me his session. He had been working in direct mode the entire time — typing a question, reading the answer, watching the agent edit files, realizing the agent had drifted from the actual intent, and course correcting, over and over.

We turned plan mode on, gave the agent the original PRD, and asked it to produce a plan instead of code. The plan came back in eight minutes. It had two real misunderstandings and one legitimate ambiguity in the PRD that no one had caught. We fixed all three in the plan, in markdown, before a single line of code was written. The feature shipped the next morning.

The lesson is not that plan mode is magic. It is that direct mode taxes you in ways most engineering teams have not yet noticed in their token bill. Plan mode is the cheapest correction you can make.

What Plan Mode from Claude actually is

The Plan Mode is a built-in operating mode in Claude Code. You toggle into it by pressing Shift+Tab on macOS until you see “plan mode” in the status bar (Shift+Tab cycles through default, acceptEdits, auto, and plan). On Windows, the equivalent is Alt+M.

In plan mode, the agent does not edit files. It reads them, searches them, and calls tools that gather information. What it produces at the end is a plan: usually a markdown document that lays out the problem, the proposed approach, the files it intends to change, the tradeoffs it sees, and any open questions.

You read the plan, edit it, push back on it, or send it back for revision. When you are satisfied, you take the agent out of plan mode and let it implement. That is the whole mechanic — two keystrokes and a habit. The reason it matters is not the mechanic itself, but what the mechanic forces.

Read more: The Three Eras of Software: From Autocomplete to Agentic Development

Why direct mode burns tokens on Claude

Three things go wrong in direct mode that plan mode catches before they cost real money.

The first is misunderstood intent. You write a prompt, the agent infers what you mean, makes a decision, and starts editing. By the time you realize the inference was wrong, the agent has touched fifteen files and you are reading a diff.

You revert, you reprompt, the agent re-reads the files, and you pay the token cost again. If the misunderstanding is structural, you pay it a third time. Plan mode catches this early: the agent writes “I intend to add a new column to the users table and backfill it with…” and you read that sentence and redirect before a single file has been touched. You spent 200 tokens correcting the plan instead of 50,000 reverting an implementation.

The second is context window degradation, which is less obvious and larger in impact. Long-context LLMs do not maintain uniform attention across the whole window. Stanford’s Lost in the Middle paper demonstrated that retrieval accuracy drops 20 to 30 percentage points when the relevant document sits in the middle of a 20-document context, compared to start or end, and the effect compounds in longer contexts.

Anthropic’s own documentation talks about “context rot,” and through early 2026 they charged a 2x input premium and a 1.5x output premium for any request above 200K tokens — their quietest possible admission that quality degrades past that boundary. Their published MRCR v2 benchmark supports it: 93% accuracy at 256K tokens, 76% at 1M tokens.

Direct mode pushes you toward that ceiling because each redo dumps more context into the window, and past 200K you are paying premium prices for degraded performance without being able to see it happening unless you are watching the token counter.

The third is that the agent never writes down its decisions. In direct mode, the only artifact of the agent’s reasoning is the code. The reasoning lives inside the conversation, and the conversation is ephemeral. The next session, the next engineer, the next subagent has to redo the reasoning from scratch, paying for the same thinking twice. Plan mode produces a file — durable, portable, paid for once.

The Research, Plan, Implement pattern

The strongest practitioner formulation of this comes from Boris Cherny, who created Claude Code. He describes a three-phase pattern that has become my default at Cheesecake Labs.

Phase one is research. A dirty window where the agent reads files, runs greps, calls MCPs, searches the web, and looks at the production database if it is connected. No implementation — the session’s only job is to come out the other side with a complete picture of the problem.

Phase two is plan. Still in the same window, the agent writes a plan, a design, and a task breakdown — three markdown files in the repo: spec.md, design.md, tasks.md. The human reviews each, and important decisions get written down rather than held in the agent’s head.

Phase three is implement. New window, clean context. The agent receives the markdown files as input and does nothing else — no research, no design, just executing the tasks one by one, updating state in tasks.md, and shipping.

The reason this works is the same reason a senior engineer hands a junior a written PRD instead of explaining it verbally five times. Writing the plan compresses the thinking into a portable artifact that does not blow up the next context window, survives the session, and can be picked up by another agent, another engineer, or yourself a week later with no loss.

It also decouples the cost of thinking from the cost of doing. Research is expensive in tokens because the agent has to read a lot; implementation is expensive because the agent has to write a lot. Running both in the same window means paying both costs against the same context budget, so that by the time the agent is implementing, the research has been sitting in the window for an hour. Separate windows keep each phase within the model’s reliable range, maintaining quality without changing the absolute cost.

Read more: Agent Skills: Stop Stuffing Workflows Into Your Rules File

A worked example. What direct mode costs in dollars

I want to put numbers on this so the cost is not abstract.

Take Sonnet 4.6 at current May 2026 pricing: $3 per million input tokens, $15 per million output tokens. Consider a medium feature — a two-page PRD, a mid-sized codebase, implementation touching roughly fifteen files.

Direct-mode workflow, in practice: the agent reads ten files to orient (around 15K input tokens), generates a first implementation with two errors (10K input, 8K output), gets redirected twice (12K then 14K input, 6K then 5K output), debugs a test failure (18K input, 4K output), and closes with minor cleanup (10K input, 2K output). Total: around 79K input, 25K output, or roughly $0.62 in API cost — plus half a day of engineer time spent correcting course.

Plan-mode workflow on the same feature: plan mode session with agent reading twelve files and producing plan, design, and tasks (20K input, 8K output); human review with two corrections written directly into markdown at zero token cost; new window implementing against the markdown files (30K input, 12K output); tests pass on first run, minor cleanup (5K input, 2K output). Total: around 55K input, 22K output, or about $0.49.

The savings on a single feature are not the point. The point is what those numbers look like at scale — across a team of fifteen engineers shipping forty features a month, weighted by the share of features large enough that direct mode would have required three or four full reverts rather than two, and compounded by the rework that lives in the next session when a fresh agent has no record of the decisions made in the last one.

Across our client engagements at Cheesecake Labs, plan mode consistently runs 20 to 35% cheaper in raw tokens than direct mode. That is the floor. On genuinely complex features where direct mode would have required three or four full reverts, we have seen the cost drop from $40 in tokens to $8 by switching the workflow. We no longer measure tokens per session. We measure cost per feature shipped — the number that actually ties to business outcomes.

Read more: Your AI Strategy Has a Data Problem

The handoff. Why fresh windows are not optional

The single move that surprised me most when I switched workflows was how much cheaper everything got when I stopped trying to keep one context window alive across a whole feature.

Long sessions are seductive because the agent remembers what it just did and you do not have to re-explain, creating an illusion of continuity that feels like productivity. That illusion is the problem. By the time the window is at 300K tokens, the agent’s attention to the early parts of the conversation is dropping.

By 600K, it is making decisions based on the prompt from ninety minutes ago, refracted through everything that happened since. Past 1M, you are at the MRCR v2 number of 76%: one in four multi-needle retrievals fails, and the agent silently makes the wrong choice without you being able to see it.

The fix is to write down the durable parts and start fresh — plan in window A, save the plan to spec.md and tasks.md, open window B, hand it the markdown files, and implement. This pattern extends further: one agent does research and writes findings to research.md, a second reads the research and produces the plan, a third reads the plan and implements. Each agent’s job is small, each context is clean, each artifact is durable.

This is the pattern that lets a single engineer change ninety files without ever blowing past 50K tokens in any single window. It also enables parallelism: once plans live in files, multiple agents can pick up multiple tasks simultaneously in separate worktrees, which is exactly how Boris Cherny describes running five agents in parallel on a single feature.

What I’d actually do with Plan Mode

Three moves to put in front of your team this week. First, make plan mode the default for anything non-trivial. A trivial change is renaming a variable or adding a unit test for an existing function. If the change touches more than two files or modifies a contract, the engineer goes into plan mode first. Make it a team rule, enforce it for thirty days, and watch the rework rate drop in the first week.

Second, save plans to the repo, not the chat. Every approved plan gets committed as docs/specs/<feature>.md or similar. The plan becomes a versioned artifact that future agents, future engineers, and code reviewers can read during PR review, the single highest-leverage move you can make to compound your agent investment.

Third, measure cost per feature, not tokens per session. Tokens per session is a vanity metric. Cost per feature shipped is the number that ties to business outcomes. If your team cannot answer “what did this feature cost in API tokens, end to end, including all rework,” you are flying blind — and you cannot tell the CFO why next quarter’s API budget should move in either direction.

Closing thought

Plan mode is two keystrokes. What makes it consequential is not the feature itself, but what it forces: writing intent down before spending tokens on code, decoupling thinking from doing, and producing durable artifacts that survive the session.

Direct mode is era two with a better autocomplete. Plan mode is the lowest-cost entry point to era three, and if your team has not made the switch yet, that is the next thirty days of work.

FAQ

What is Plan Mode in Claude Code and how do you activate it?

Plan Mode is a built-in operating mode in Claude Code where the agent does not edit files — it reads them, searches them, and calls tools to gather information, producing a markdown plan that lays out the problem, proposed approach, files to change, tradeoffs, and open questions. You toggle into it by pressing Shift+Tab on macOS until 'plan mode' appears in the status bar (cycling through default, acceptEdits, auto, and plan). On Windows, the equivalent is Alt+M.

Why does direct mode burn more tokens than plan mode?

Three things go wrong in direct mode: (1) misunderstood intent — the agent edits many files before you realize the inference was wrong, forcing reverts and re-reads; (2) context window degradation — long contexts lose attention quality, with Anthropic's MRCR v2 benchmark showing 93% accuracy at 256K tokens dropping to 76% at 1M tokens, plus premium pricing past 200K tokens; and (3) the agent never writes down its decisions, so reasoning lives only in the ephemeral conversation and must be redone in the next session.

What is the Research, Plan, Implement pattern?

It is a three-phase pattern described by Boris Cherny, creator of Claude Code. Phase one (research): the agent reads files, runs greps, calls MCPs, and gathers a complete picture of the problem. Phase two (plan): still in the same window, the agent writes spec.md, design.md, and tasks.md, which the human reviews. Phase three (implement): a new window with clean context receives the markdown files and executes tasks one by one, updating tasks.md.

What are the measured cost differences between direct mode and plan mode?

Using Sonnet 4.6 pricing ($3 per million input tokens, $15 per million output tokens) on a medium feature, direct mode totals around 79K input and 25K output tokens (~$0.62), while plan mode totals around 55K input and 22K output tokens (~$0.49). Across client engagements, plan mode consistently runs 20 to 35% cheaper in raw tokens. On genuinely complex features that would require three or four reverts in direct mode, costs have dropped from $40 to $8.

What three actions are recommended to adopt plan mode on a team?

First, make plan mode the default for anything non-trivial — if a change touches more than two files or modifies a contract, use plan mode first. Second, save approved plans to the repo (e.g., docs/specs/.md) as versioned artifacts, not in the chat. Third, measure cost per feature shipped end to end, including rework, rather than tokens per session.

About the author.

Douglas da Silva
Douglas da Silva

Douglas started as a Senior FullStack Developer at Cheesecake Labs and currently he's Partner and CTO at the company.