Spec-Driven Development: How to Capture Intent Before You Burn Tokens

Summary

Spec-Driven Development (SDD) is the practice of writing a structured specification as versioned markdown in the repo before an agent writes code, organized into four phases: Specify, Design, Tasks, and Execute.
SDD addresses three agent failure modes directly — the one-shot hero, premature victory, and cross-session amnesia — by making the spec a durable, reviewable, and composable artifact rather than a conversation.
SDD alone does not solve fake done, self-judgment, or accumulated slop; these require additional mechanisms like completion gates, a separate judge agent (LLM-as-Judge), and architectural conventions encoded in the harness.
Toolkits like AWS Kiro and GitHub Spec Kit implement the same four-phase shape under different names, and the more important decision is to pick one and stay consistent.

In February 2025, Andrej Karpathy tweeted a phrase that stuck: “There’s a new kind of coding I call vibe coding, where you fully give in to the vibes, embrace exponentials, and forget that the code even exists.” He was being half-serious. The tweet has 4.5 million views, and Karpathy later called it “a shower of thoughts throwaway.” The phrase did not stay a throwaway — it became the shorthand for an entire failure mode.

The failure looks like this. A senior engineer prompts an agent: “Build me a checkout flow with Stripe, auth, and email confirmation.” The agent runs for 35 minutes, modifies forty-seven files, and produces 3,200 lines of code.

The engineer opens the diff and finds that some of it works, some does not compile, the Stripe integration uses a deprecated API, the email confirmation logic is wired to a service that does not exist, and the auth flow looks correct but silently bypasses the rate limiter. None of this is caught in code review because there is simply too much of it to read.

The reaction in most teams is to slow the agent down, smaller prompts, closer supervision, reverting to era-two pair programming with extra steps. That is not the right move. The right move is to write the spec down before the agent runs, not in the conversation but in the repo, as a markdown file, with enough detail that the agent, the next engineer, the QA gate, and the reviewer are all working from the same source of truth.

That is Spec-Driven Development (SDD), the most useful discipline I have adopted at Cheesecake Labs, and one I have watched cut feature cycle times by half on the right kinds of work.

There's a new kind of coding I call "vibe coding", where you fully give in to the vibes, embrace exponentials, and forget that the code even exists. It's possible because the LLMs (e.g. Cursor Composer w Sonnet) are getting too good. Also I just talk to Composer with SuperWhisper…
— Andrej Karpathy (@karpathy) February 2, 2025

The six ways agents fail at scale

Before I talk about what Spec-Driven Development fixes, I want to be specific about the failure modes it is trying to address. Most teams talk about “agents not being reliable” as if it were one problem. It is six.

The first is the one-shot hero: the agent tries to build the entire feature in a single window, runs out of context budget, starts losing coherence past 200K tokens, and ships something that looks finished but breaks at integration. This is what vibe coding looks like up close.

The second is the premature victory. The agent declares the task done with major pieces still missing — it shipped the happy path, skipped the error handling, and called itself complete. You discover the gaps in QA, or worse, in production.

The third is cross-session amnesia. A new window has zero memory of what came before, so decisions made yesterday have to be rediscovered today, constraints that took thirty minutes to articulate get forgotten on the next prompt, and you pay for the same thinking twice.

The fourth is fake done. The agent ran a curl command, got a 200 response, and considers the integration working. It does not test the unhappy paths, does not verify the contract, and moves on after seeing a successful HTTP code. The bug ships.

The fifth is self-judgment: the same agent that implemented the feature also decides whether it is complete. The executor approves itself — structurally the same conflict of interest as a developer reviewing their own pull request, except the agent has no professional reputation to protect.

The sixth is accumulated slop. Each individual feature compiles, passes tests, and ships, but the architecture drifts and the conventions degrade. After 100 features built this way, the codebase is technically functional and structurally garbage. The agent did not violate any rule explicitly — it just never had one to follow.

These are not theoretical. We see all six at clients, often in the same engagement, and they are not the agent being bad at its job. They are predictable failure modes of unsupervised execution. SDD addresses the first three directly and sets up the conditions for the system to handle the other three.

Read more: The Three Eras of Software: From Autocomplete to Agentic Development

What Spec-Driven Development actually is

Spec-Driven Development (SDD) is the practice of writing a structured specification in the repo, as versioned markdown, before an agent writes a single line of code. It is organized into four phases, with names that vary by toolkit but a shape that stays the same.

Phase one is Specify: You sit with an agent in plan mode and write the spec — not “build a checkout flow,” but a real spec: what is the user problem, who is in scope, what are the acceptance criteria, what is explicitly out of scope, what edge cases matter. The output is a markdown file, typically spec.md, that the engineer and the agent agree on before anything else happens.

Phase two is Design. Given the spec, the agent proposes an architecture, a sequence of operations, the data model, the components, the API contracts, and the key decisions. The output is design.md. The human reviews it, pushes back on what is wrong, and confirms what is right.

Phase three is Tasks. The design gets decomposed into atomic, ordered tasks — each one small enough to fit in a single context window with room to spare, each one with clear acceptance criteria, each one independently testable. The output is tasks.md.

Phase four is Execute. Subagents take tasks off the list, implement them, and update state. Each runs in a new window, reads the spec, the design, and the relevant task, and produces code for that task only. The state of the work lives in tasks.md, not in any agent’s memory.

The same four phases show up under different names across the tooling. AWS Kiro uses Requirements, Design, Tasks (and writes its requirements in EARS notation, “WHEN [condition] THE SYSTEM SHALL [behavior]”). GitHub Spec Kit uses Spec, Plan, Tasks, Implement. What we use at Cheesecake Labs, follows the Specify, Design, Tasks, Execute and auto-sizes the depth of each phase to the complexity of the work. The differences are real but not strategic. Pick one. Use it.

Why writing the spec to a file changes everything

Most engineers I talk to nod at the idea of writing a spec and then carry on doing the work in chat. The “spec” lives in the conversation. The Artificial Intelligence agent’s understanding of the feature lives in the agent’s head. Nothing is durable.

Most engineers I talk to nod at the idea of writing a spec and then carry on doing the work in chat. The “spec” lives in the conversation, the agent’s understanding of the feature lives in the agent’s head, and nothing is durable. The single insight of SDD that I want every engineer to internalize is this: the spec is the artifact, not the conversation.

Three things flip the moment the spec is a file in the repo.

The spec survives the session

The next agent that picks up a task reads the spec. The next engineer that joins the project reads the spec. The QA agent that validates the implementation reads the spec. The pull request reviewer reads the spec. Everyone gets the same source of truth, and you stop paying for the same conversation three times.

The spec is reviewable

A pull request describing five files of code is hard to review; a pull request that says “implements task 4 of docs/specs/checkout.md” is easy to review against the original intent. The reviewer can read what was agreed on, then read the diff, and ask whether one matches the other. The conversation moves up a level, from line-by-line to intent-to-implementation.

The spec is composable

You can have one agent specify, another design, a third break down tasks, a fourth implement, and a fifth judge — with each agent’s input and output being a markdown file or code. The whole pipeline is just files, which is how a single engineer can change ninety files across a feature without ever blowing past 50K tokens in any one window.

Sean Grove (OpenAI) made the most provocative version of this argument in his AI Engineer World’s Fair talk last year. His claim: “the code you write represents only 10 to 20% of the value you bring as a programmer; the other 80 to 90% lies in structured communication.” His framing is that the spec is the actual product and code is the compiled output. SDD is the practical implementation of that idea.

Or as the GitHub Spec Kit team puts it in their open-source toolkit, the core mental model is “specifications don’t serve code, code serves specifications.” The hierarchy is the opposite of what most engineers internalized over the last twenty years.

Read more: Agent Skills: Stop Stuffing Workflows Into Your Rules File

The toolkits from SDD

There is no strong reason to prefer one SDD toolkit over another, they all work, and the more important decision is to pick one and stay consistent.

AWS Kiro is Amazon’s spec-driven IDE. It launched in preview on July 15, 2025, hit general availability on November 17, 2025, and the launch team reported 250,000+ developers in the first three months. It is opinionated about the workflow — three files, EARS notation for requirements, structured design with sequence diagrams, trackable tasks — and if you want the spec workflow baked into the IDE without having to think about the harness, Kiro is the simplest path.

GitHub Spec Kit is the open-source alternative: MIT-licensed, compatible with 30+ coding AI agents including Claude Code, Copilot, and Gemini CLI. The workflow follows Spec, Plan, Tasks, Implement, and the repo includes templates, slash commands, and a methodology document called spec-driven.md that is worth reading even if you do not adopt the toolkit. The advantage is portability — your specs do not live inside a specific IDE.

Where SDD alone falls short

SDD addresses the one-shot hero, premature victory, and cross-session amnesia directly. The spec prevents the agent from trying to build the whole feature in one window. The task breakdown prevents premature victory because each task has acceptance criteria. The markdown files solve cross-session amnesia by persisting decisions outside any agent’s memory.

What SDD does not address on its own are fake done, self-judgment, and accumulated slop — those are not spec problems, they are execution and review problems.

Fake done is fixed by completion gates the agent cannot bypass: lint, typecheck, and tests on every commit; diff coverage that verifies the diff actually addresses the spec; a faithfulness check that compares the implementation back to the design. None of these are part of the spec, they are part of the harness.

Self-judgment is fixed by introducing a separate judge agent. The executor ships the implementation, and a second model — with no context other than the spec and the diff — evaluates whether the implementation meets the acceptance criteria. The executor cannot grade itself. We call this LLM-as-Judge (the original pattern from Zheng et al., 2023), and at Cheesecake Labs we run it on every non-trivial PR.

Accumulated slop is fixed by architectural conventions encoded in the harness: CLAUDE.md files at the project root that capture conventions, skills that encode the right way to do common tasks, and code review agents that enforce them. Without those, even a perfect spec gets executed in ways that drift the architecture.

Kief Morris at Thoughtworks has the cleanest framing for this. In his March 2026 piece “Humans and Agents in Software Engineering Loops”, he distinguishes three roles. “In the loop” is the engineer who reviews every agent output line by line — the bottleneck. “On the loop” is the engineer who designs and maintains the mechanisms that guide and validate agent behavior, building the harness including the spec workflow and the gates.

“Out of the loop” is the case where the harness is mature enough that the agent can run largely autonomously, with the human auditing aggregate outputs. SDD is what gets you from “in the loop” to “on the loop.” The harness is what gets you to “out of the loop” safely. SDD without a harness leaves you stuck with a stack of nicely written specs that still get shipped wrong.

How do I use Spec-Driven Development

Four moves to put on the table this quarter.

First, pick an SDD toolkit and standardize on it. Whether that is Kiro, Spec Kit, or a set of internal skills, the wrong move is leaving each engineer to invent their own spec workflow. Commit to one shape and use it for everything over a defined threshold — we use “anything that touches more than two files.” The variance you eliminate by standardizing is worth more than the variance you preserve by staying flexible.

Commit specs to the repo. Every approved spec lives in docs/specs/<feature>/ with spec.md, design.md, and tasks.md. Pull requests reference the spec they implement, and the spec is reviewed in code review like code is. This single move makes the entire team’s planning legible to itself and to future hires.

Add the completion gates in order of cost. Lint, typecheck, and tests on every PR are table stakes. Then diff coverage that verifies the implementation touched the right files, a faithfulness check comparing the implementation to the design, and a judge agent that evaluates against the acceptance criteria in the spec. Cheap filters run first; the judge agent runs only on PRs that pass them.

For last, separate the executor from the judge. No agent grades its own work — the agent that implemented the feature does not get to declare it done. A separate model, ideally a different one (we run Opus 4.7 as judge over implementations done with Sonnet 4.6), reads the spec and the diff and produces an explicit accept or reject with reasons. The executor sees the rejection and retries with feedback. This is the single highest-leverage move you can make on output quality.

The order matters: get the spec workflow in place first, because the gates and the judge both depend on having a spec to validate against. Without the spec, the gates have nothing to check.

Closing thought

The agents available in 2026 are good. The bottleneck is the surrounding system: how intent gets captured, how completion gets verified, and how architecture stays coherent over hundreds of features. Specs are the cheapest leverage in that system.

They cost minutes to write, they survive sessions, and they make every other gate possible. If your team has not made the move from “spec in the chat” to “spec in the repo,” that is the next thirty days of work. Then comes the harness.

FAQ

What is Spec-Driven Development (SDD)?

Spec-Driven Development is the practice of writing a structured specification in the repo, as versioned markdown, before an agent writes a single line of code. It is organized into four phases: Specify, Design, Tasks, and Execute, with each phase producing a markdown file (such as spec.md, design.md, and tasks.md) that serves as the shared source of truth for engineers, agents, QA, and reviewers.

What is 'vibe coding' and why is it a problem?

Vibe coding is a term coined by Andrej Karpathy in a February 2025 tweet, describing a style where you 'fully give in to the vibes, embrace exponentials, and forget that the code even exists.' It becomes a failure mode when an agent produces large volumes of code (for example, 3,200 lines across 47 files) that cannot be properly reviewed, leading to issues like deprecated APIs, broken integrations, and silently bypassed logic.

What are the six ways agents fail at scale?

The six failure modes are: (1) the one-shot hero, where the agent tries to build everything in one window and loses coherence; (2) premature victory, where the agent declares completion with major pieces missing; (3) cross-session amnesia, where new windows have no memory of prior decisions; (4) fake done, where the agent treats a 200 response as a working integration; (5) self-judgment, where the same agent that implemented the feature decides whether it is complete; and (6) accumulated slop, where individual features ship but architecture and conventions degrade over time.

Which failure modes does SDD address, and which require additional measures?

SDD directly addresses the one-shot hero, premature victory, and cross-session amnesia. It does not on its own solve fake done, self-judgment, or accumulated slop. Those require completion gates (lint, typecheck, tests, diff coverage, faithfulness checks), a separate judge agent (LLM-as-Judge), and architectural conventions encoded in the harness (such as CLAUDE.md files, skills, and code review agents).

What SDD toolkits are available?

Two toolkits mentioned are AWS Kiro, Amazon's spec-driven IDE that launched in preview on July 15, 2025 and reached general availability on November 17, 2025, using Requirements, Design, and Tasks with EARS notation; and GitHub Spec Kit, an MIT-licensed open-source alternative compatible with 30+ coding AI agents including Claude Code, Copilot, and Gemini CLI, following a Spec, Plan, Tasks, Implement workflow.

About the author.

Douglas da Silva

Douglas started as a Senior FullStack Developer at Cheesecake Labs and currently he's Partner and CTO at the company.