Harness Engineering: Why “Done” Isn’t the Agent Saying So

Summary
  • The post defines an agent 'harness' as having two halves: guides (feed-forward controls like CLAUDE.md, PR templates, and specs that steer the agent before it acts) and sensors (feedback controls like linters, tests, review agents, and judge models that detect errors after).
  • It presents Kief Morris's maturity ladder—'in the loop,' 'on the loop,' and 'out of the loop'—arguing most teams remain stuck reviewing every diff by hand, and that moving to 'on the loop' (building the harness instead of inspecting outputs) is the key shift for senior engineers.
  • It describes a five-gate completion cascade (structural, file integrity, sufficiency, faithfulness, and judge LLM) ordered from cheap to expensive, and argues the executor agent must never grade itself—requiring a separate judge model, with Sonnet 4.6 used for implementation and Opus 4.7 for judging.
  • It advocates a flywheel where every rejected PR generates a classified record, reviewed in a weekly 30-minute retro to add new guides, skills, or gates, so the harness compounds and prevents repeat failures.

In November 2025, Justin Young at Anthropic published a post on what they had learned running long-running coding agents. One observation in that post stuck with me. He described a failure mode where Claude, after some progress on a project, would look at the state of the work, see that things had been built, and declare the job done. The verb was “declare.” The agent ran a curl, got a 200 back, called the integration finished, and moved on.

Five months later, Birgitta Böckeler at Thoughtworks published the cleanest writeup I have read of what we should build around the model to stop that from happening. She called the layer the harness, and she split it into two halves: guides (feed-forward controls that anticipate the agent’s behavior and steer it before it acts) and sensors (feedback controls that observe the result and help it self-correct).

Two months before Böckeler’s piece, Kief Morris had published the loop framework that gave the layer its strategic shape.

What the harness actually is

If you say “harness” to most engineers in 2026, they will nod and tell you they are doing it. Then you ask what they mean and you get a list of three things: linters, tests, CI. That is not a harness. That is a small slice of the sensor layer with no guides and no judgment.

Böckeler’s framing is the one I have adopted internally at Cheesecake Labs. The harness has two halves.

Guides are feed-forward

They shape what the agent does before it acts. The CLAUDE.md file at the project root, the PR template the agent reads before opening a pull request, the spec it must implement against, the skill that encodes how this team does database migrations, the architectural conventions written as plain English that the agent reads on every session.

Guides are cheap and they compound — and they are also the part of the harness most teams under-invest in, because none of it ships a feature on its own.

Sensors are feedback

The sensors observe what the agent did and tell it whether it worked. Linters and type checkers, yes, but also test suites that actually run, separate review agents that read the diff, faithfulness checks that compare the implementation back to the spec, hooks on commit and on PR open, and the judge model that grades the work against the acceptance criteria. Sensors are how the agent learns it was wrong.

Böckeler’s sharpest point is that you need both. Sensors alone leave you with an agent that keeps making the same mistake because nothing told it the right rule upstream. Guides alone leave you with an agent that follows the rules but never finds out whether they produced the right outcome. A real harness is the closed loop between the two.

Read more: Spec-Driven Development: How to Capture Intent Before You Burn Tokens

The maturity ladder. Where most teams sit.

Morris’s in / on / out of the loop framing is the most useful diagnostic tool I have for talking to engineering leaders right now. I ask them where their team sits and the answer tells me what to invest in next.

In the loop

The engineer reviews every line the agent produces. They are the gatekeeper on the innermost loop, where code gets generated. Morris’s words: “the challenge when we insist on being too closely involved in the process is that we become a bottleneck.”

Most teams I see live here: they use Claude Code, they ship features, and their senior engineers spend half their day reviewing AI-generated diffs by hand. The agent went faster and the reviewer became the constraint.

On the loop

The engineer designs and maintains the mechanisms that produce and validate the agent’s work. Morris again: “Rather than personally inspecting what the agents produce, we can make them better at producing it.”

This is where harness engineering becomes a real category of work. You stop fixing individual bad PRs and start fixing the system that produced them. The senior engineer’s job shifts from line reviewer to harness builder.

Out of the loop

The harness is mature enough that the agent runs largely autonomously and the human audits aggregate outputs. Morris calls this the natural home for what people loosely term “vibe coding,” but only when the harness is strong enough to keep vibe-coded output safe. Without that harness, “out of the loop” is just shipping bugs faster.

Flywheel

There is a fourth rung Böckeler implies but does not name: the harness improving from its own outputs. Failed gates become CLAUDE.md updates, rejected PRs become new tests, and the harness compounds. This is where the leverage is.

The leap from in to on is the single biggest career move of the next two years for senior engineers, and the single biggest architecture move for engineering leaders. The next leap, from on to out, requires the harness to actually be good.

Completion gates. The cascade.

This is the part of the harness most teams have not built and most need. The premise is that “done” is not the agent saying so — it is the system proving it. A task moves through a cascade of checks, cheap filters first, before it is accepted, with each gate catching a specific failure mode and returning a classified reason and an actionable fix when it fails.

The cascade I run at Cheesecake Labs has five gates. None of these are individually novel. The order and the framing are mine, built on top of Böckeler’s guides-and-sensors model and Anthropic’s failure-mode taxonomy.

Gate 1: Structural

Lint, typecheck, unit tests. Cheap, deterministic. Catches what Anthropic calls the “marks complete without verification” failure: the agent ran a curl, got a 200, declared the integration working. This gate fails roughly 30 to 50% of first-shot agent PRs in my experience. That is a healthy signal. It means the gate is doing its job.

Gate 2: File integrity

Critical files untouched. Did the agent silently delete tests to make them pass? Rewrite an API contract instead of conforming to it? Modify a config file outside the change scope? These are the failures that ship as silent regressions. A simple allow-list on which files a task is permitted to touch catches most of them. The fix takes one line of YAML. Not enough teams have it.

Gate 3: Sufficiency

Does the diff actually cover the scope of the spec? An agent that ships half the feature and declares the rest “follow-up work” is the most common failure mode I see in plan mode workflows. The fix is mechanical: every task in tasks.md has acceptance criteria, every PR maps to one task, and the gate verifies the diff touched what the task said it would.

Gate 4: Faithfulness

Does the implementation actually do what the design said it would do? This is the gate most teams skip. The mechanic is borrowed from RAG evaluation tools like RAGAS and adapted: compute a semantic similarity between the diff and the design markdown. Cheap, embedding-based, runs in seconds. It is a filter, not a verdict. If the similarity is below a threshold, the PR fails before it ever pays for a judge model.

Gate 5: Judge LLM

A separate model, ideally a different one, reads the spec, the design, the tests, and the diff, and produces an explicit accept or reject with reasons. Zheng et al. (2023) showed strong LLM judges agree with humans more than 80% of the time, “the same level of agreement between humans.”

The Agent-as-a-Judge work from Meta in 2024 extended this specifically to coding agents and found it “dramatically outperforms LLM-as-a-Judge and is as reliable as our human evaluation baseline.” The judge runs only on PRs that already passed gates 1 through 4, so it is the most expensive gate but you pay it the least often.

The order matters because the judge is expensive in tokens and latency — you only pay for it on PRs that have already cleared the structural, integrity, sufficiency, and faithfulness checks. By the time the judge runs, it is evaluating work that is plausibly correct, not work that already failed lint.

Read more: The Three Eras of Software: From Autocomplete to Agentic Development

LLM-as-Judge. Never let the executor grade itself.

I want to underline the structural argument here because it is the move most teams resist most stubbornly.

The agent that implemented the feature does not get to decide whether it is done. Same conflict of interest as a developer reviewing their own PR. The executor agent has an incentive (built into how it was prompted) to converge on “complete.” If you let it grade itself, it will grade itself generously. The only fix is a separate evaluator.

In practice this means two agents, ideally two different models. At Cheesecake Labs we run Sonnet 4.6 for implementation and Opus 4.7 for judging. The judge sees only the spec, the design, the tests, and the diff. It does not see the executor’s reasoning. It does not see the chat history. It produces a structured verdict: accept, reject (with classified reasons), or request clarification (with specific questions).

The rejection rate on first pass is between 15 and 25% depending on the team and the spec quality. That is not a sign that the executor is bad. It is a sign that the judge is doing its job. Without the judge, that 15 to 25% of work was shipping as “done” and getting caught later, either in QA, in the next sprint, or in production.

The DORA 2025 Accelerate State of DevOps report puts the wider point most directly: “AI doesn’t fix a team. It amplifies what’s already there.” If your “done” definition was already loose, AI ships more loose-definition work faster. The judge tightens the definition.

Where the harness learns

The fourth rung, the flywheel, is where this work compounds. Most teams never get there because they treat each rejected PR as a one-off. The PR gets fixed, merged, and forgotten.

The pattern that gets you to the flywheel is mechanical and unglamorous. Every rejected PR generates a record: what failed, why, what the fix was. Every week, the team reviews those records and asks one question. Is there a guide we could add that would have prevented this? A CLAUDE.md entry, a skill, a new gate, a new test. If yes, add it. Commit it. Now the next PR cannot fail in the same way.

Run a thirty-minute weekly harness retro on the projects where the harness is mature enough to support it. The first month it feels like overhead. By month three it feels like the most leveraged thirty minutes on the calendar. The cost of building the harness is paid down by the harness itself.

Closing thought

The discourse on agentic coding is converging fast. Anthropic’s harness post, Birgitta Böckeler’s guides and sensors, Kief Morris’s loop positions, the SWE-Bench Pro gap, the DORA 2025 finding. The framing is settling. The model is not the bottleneck. The harness is.

On the loop is where you change the harness that produced the artifact. That is the line from Morris’s piece I keep coming back to. The senior engineer who was the bottleneck in the In-the-loop world becomes the most leveraged person in the company in the On-the-loop world.

The job description changes. The output of a great senior engineer is no longer code. It is the system that makes the next hundred features ship correctly with much less of their time.

On Cheesecake Labs, we help engineering organizations move from “we use Claude Code” to “we built the harness that lets us trust Claude Code.” Gates, judges, classified failure logs, the unglamorous infrastructure that turns agentic coding into a delivery system.

If your senior engineers are spending most of their week reviewing agent diffs by hand, talk with us. The fix is usually a harness fix, and it pays back in weeks.

FAQ

What is a harness in the context of agentic coding?

The harness is the layer built around the model to prevent agents from declaring work done without verification. Per Birgitta Böckeler's framing, it has two halves: guides (feed-forward controls that anticipate the agent's behavior and steer it before it acts, such as CLAUDE.md files, PR templates, specs, skills, and architectural conventions) and sensors (feedback controls that observe the result and help it self-correct, such as linters, type checkers, test suites, review agents, faithfulness checks, commit/PR hooks, and judge models). A real harness is the closed loop between the two.

What are the three positions in Kief Morris's loop framework, and where do most teams sit?

In the loop: the engineer reviews every line the agent produces, becoming the bottleneck. On the loop: the engineer designs and maintains the mechanisms that produce and validate the agent's work. Out of the loop: the harness is mature enough that the agent runs largely autonomously and the human audits aggregate outputs. Most teams sit 'in the loop,' with senior engineers spending half their day reviewing AI-generated diffs by hand.

What are the five completion gates in the cascade described in the post?

Gate 1 Structural: lint, typecheck, unit tests. Gate 2 File integrity: ensures critical files are untouched and changes stay within an allow-list. Gate 3 Sufficiency: verifies the diff covers the scope of the spec and maps to acceptance criteria. Gate 4 Faithfulness: uses embedding-based semantic similarity between the diff and the design markdown as a filter. Gate 5 Judge LLM: a separate model reads the spec, design, tests, and diff and produces an explicit accept or reject with reasons. The order moves from cheap deterministic checks to the most expensive judge model.

Why should the executor agent not grade its own work?

The agent that implemented the feature has an incentive built into how it was prompted to converge on 'complete,' creating the same conflict of interest as a developer reviewing their own PR. The fix is a separate evaluator, ideally a different model. In practice this means running one model for implementation and another for judging, where the judge sees only the spec, the design, the tests, and the diff — not the executor's reasoning or chat history — and produces a structured verdict: accept, reject with classified reasons, or request clarification.

How does the harness improve over time (the flywheel)?

Every rejected PR generates a record of what failed, why, and what the fix was. Each week, the team reviews those records and asks whether a guide could have prevented the failure — a CLAUDE.md entry, a skill, a new gate, or a new test. If yes, it is added and committed, so the next PR cannot fail in the same way. A thirty-minute weekly harness retro is recommended on projects where the harness is mature enough to support it.

About the author.

Douglas da Silva
Douglas da Silva

Douglas started as a Senior FullStack Developer at Cheesecake Labs and currently he's Partner and CTO at the company.