Scaling AI: A 6-Month Path from Champions to Company-Wide
Douglas da Silva | May 27, 2026
Listen to this article
In November 2025, Justin Young at Anthropic published a post on what they had learned running long-running coding agents. One observation in that post stuck with me. He described a failure mode where Claude, after some progress on a project, would look at the state of the work, see that things had been built, and declare the job done. The verb was “declare.” The agent ran a curl, got a 200 back, called the integration finished, and moved on.
Five months later, Birgitta Böckeler at Thoughtworks published the cleanest writeup I have read of what we should build around the model to stop that from happening. She called the layer the harness, and she split it into two halves: guides (feed-forward controls that anticipate the agent’s behavior and steer it before it acts) and sensors (feedback controls that observe the result and help it self-correct).
Two months before Böckeler’s piece, Kief Morris had published the loop framework that gave the layer its strategic shape.
If you say “harness” to most engineers in 2026, they will nod and tell you they are doing it. Then you ask what they mean and you get a list of three things: linters, tests, CI. That is not a harness. That is a small slice of the sensor layer with no guides and no judgment.
Böckeler’s framing is the one I have adopted internally at Cheesecake Labs. The harness has two halves.
They shape what the agent does before it acts. The CLAUDE.md file at the project root, the PR template the agent reads before opening a pull request, the spec it must implement against, the skill that encodes how this team does database migrations, the architectural conventions written as plain English that the agent reads on every session.
Guides are cheap and they compound — and they are also the part of the harness most teams under-invest in, because none of it ships a feature on its own.
The sensors observe what the agent did and tell it whether it worked. Linters and type checkers, yes, but also test suites that actually run, separate review agents that read the diff, faithfulness checks that compare the implementation back to the spec, hooks on commit and on PR open, and the judge model that grades the work against the acceptance criteria. Sensors are how the agent learns it was wrong.
Böckeler’s sharpest point is that you need both. Sensors alone leave you with an agent that keeps making the same mistake because nothing told it the right rule upstream. Guides alone leave you with an agent that follows the rules but never finds out whether they produced the right outcome. A real harness is the closed loop between the two.
Read more: Spec-Driven Development: How to Capture Intent Before You Burn Tokens
Morris’s in / on / out of the loop framing is the most useful diagnostic tool I have for talking to engineering leaders right now. I ask them where their team sits and the answer tells me what to invest in next.
The engineer reviews every line the agent produces. They are the gatekeeper on the innermost loop, where code gets generated. Morris’s words: “the challenge when we insist on being too closely involved in the process is that we become a bottleneck.”
Most teams I see live here: they use Claude Code, they ship features, and their senior engineers spend half their day reviewing AI-generated diffs by hand. The agent went faster and the reviewer became the constraint.
The engineer designs and maintains the mechanisms that produce and validate the agent’s work. Morris again: “Rather than personally inspecting what the agents produce, we can make them better at producing it.”
This is where harness engineering becomes a real category of work. You stop fixing individual bad PRs and start fixing the system that produced them. The senior engineer’s job shifts from line reviewer to harness builder.
The harness is mature enough that the agent runs largely autonomously and the human audits aggregate outputs. Morris calls this the natural home for what people loosely term “vibe coding,” but only when the harness is strong enough to keep vibe-coded output safe. Without that harness, “out of the loop” is just shipping bugs faster.
There is a fourth rung Böckeler implies but does not name: the harness improving from its own outputs. Failed gates become CLAUDE.md updates, rejected PRs become new tests, and the harness compounds. This is where the leverage is.
The leap from in to on is the single biggest career move of the next two years for senior engineers, and the single biggest architecture move for engineering leaders. The next leap, from on to out, requires the harness to actually be good.
This is the part of the harness most teams have not built and most need. The premise is that “done” is not the agent saying so — it is the system proving it. A task moves through a cascade of checks, cheap filters first, before it is accepted, with each gate catching a specific failure mode and returning a classified reason and an actionable fix when it fails.
The cascade I run at Cheesecake Labs has five gates. None of these are individually novel. The order and the framing are mine, built on top of Böckeler’s guides-and-sensors model and Anthropic’s failure-mode taxonomy.
Lint, typecheck, unit tests. Cheap, deterministic. Catches what Anthropic calls the “marks complete without verification” failure: the agent ran a curl, got a 200, declared the integration working. This gate fails roughly 30 to 50% of first-shot agent PRs in my experience. That is a healthy signal. It means the gate is doing its job.
Critical files untouched. Did the agent silently delete tests to make them pass? Rewrite an API contract instead of conforming to it? Modify a config file outside the change scope? These are the failures that ship as silent regressions. A simple allow-list on which files a task is permitted to touch catches most of them. The fix takes one line of YAML. Not enough teams have it.
Does the diff actually cover the scope of the spec? An agent that ships half the feature and declares the rest “follow-up work” is the most common failure mode I see in plan mode workflows. The fix is mechanical: every task in tasks.md has acceptance criteria, every PR maps to one task, and the gate verifies the diff touched what the task said it would.
Does the implementation actually do what the design said it would do? This is the gate most teams skip. The mechanic is borrowed from RAG evaluation tools like RAGAS and adapted: compute a semantic similarity between the diff and the design markdown. Cheap, embedding-based, runs in seconds. It is a filter, not a verdict. If the similarity is below a threshold, the PR fails before it ever pays for a judge model.
A separate model, ideally a different one, reads the spec, the design, the tests, and the diff, and produces an explicit accept or reject with reasons. Zheng et al. (2023) showed strong LLM judges agree with humans more than 80% of the time, “the same level of agreement between humans.”
The Agent-as-a-Judge work from Meta in 2024 extended this specifically to coding agents and found it “dramatically outperforms LLM-as-a-Judge and is as reliable as our human evaluation baseline.” The judge runs only on PRs that already passed gates 1 through 4, so it is the most expensive gate but you pay it the least often.
The order matters because the judge is expensive in tokens and latency — you only pay for it on PRs that have already cleared the structural, integrity, sufficiency, and faithfulness checks. By the time the judge runs, it is evaluating work that is plausibly correct, not work that already failed lint.
Read more: The Three Eras of Software: From Autocomplete to Agentic Development
I want to underline the structural argument here because it is the move most teams resist most stubbornly.
The agent that implemented the feature does not get to decide whether it is done. Same conflict of interest as a developer reviewing their own PR. The executor agent has an incentive (built into how it was prompted) to converge on “complete.” If you let it grade itself, it will grade itself generously. The only fix is a separate evaluator.
In practice this means two agents, ideally two different models. At Cheesecake Labs we run Sonnet 4.6 for implementation and Opus 4.7 for judging. The judge sees only the spec, the design, the tests, and the diff. It does not see the executor’s reasoning. It does not see the chat history. It produces a structured verdict: accept, reject (with classified reasons), or request clarification (with specific questions).
The rejection rate on first pass is between 15 and 25% depending on the team and the spec quality. That is not a sign that the executor is bad. It is a sign that the judge is doing its job. Without the judge, that 15 to 25% of work was shipping as “done” and getting caught later, either in QA, in the next sprint, or in production.
The DORA 2025 Accelerate State of DevOps report puts the wider point most directly: “AI doesn’t fix a team. It amplifies what’s already there.” If your “done” definition was already loose, AI ships more loose-definition work faster. The judge tightens the definition.
The fourth rung, the flywheel, is where this work compounds. Most teams never get there because they treat each rejected PR as a one-off. The PR gets fixed, merged, and forgotten.
The pattern that gets you to the flywheel is mechanical and unglamorous. Every rejected PR generates a record: what failed, why, what the fix was. Every week, the team reviews those records and asks one question. Is there a guide we could add that would have prevented this? A CLAUDE.md entry, a skill, a new gate, a new test. If yes, add it. Commit it. Now the next PR cannot fail in the same way.
Run a thirty-minute weekly harness retro on the projects where the harness is mature enough to support it. The first month it feels like overhead. By month three it feels like the most leveraged thirty minutes on the calendar. The cost of building the harness is paid down by the harness itself.
The discourse on agentic coding is converging fast. Anthropic’s harness post, Birgitta Böckeler’s guides and sensors, Kief Morris’s loop positions, the SWE-Bench Pro gap, the DORA 2025 finding. The framing is settling. The model is not the bottleneck. The harness is.
On the loop is where you change the harness that produced the artifact. That is the line from Morris’s piece I keep coming back to. The senior engineer who was the bottleneck in the In-the-loop world becomes the most leveraged person in the company in the On-the-loop world.
The job description changes. The output of a great senior engineer is no longer code. It is the system that makes the next hundred features ship correctly with much less of their time.
On Cheesecake Labs, we help engineering organizations move from “we use Claude Code” to “we built the harness that lets us trust Claude Code.” Gates, judges, classified failure logs, the unglamorous infrastructure that turns agentic coding into a delivery system.
If your senior engineers are spending most of their week reviewing agent diffs by hand, talk with us. The fix is usually a harness fix, and it pays back in weeks.

The harness is the layer built around the model to prevent agents from declaring work done without verification. Per Birgitta Böckeler's framing, it has two halves: guides (feed-forward controls that anticipate the agent's behavior and steer it before it acts, such as CLAUDE.md files, PR templates, specs, skills, and architectural conventions) and sensors (feedback controls that observe the result and help it self-correct, such as linters, type checkers, test suites, review agents, faithfulness checks, commit/PR hooks, and judge models). A real harness is the closed loop between the two.
In the loop: the engineer reviews every line the agent produces, becoming the bottleneck. On the loop: the engineer designs and maintains the mechanisms that produce and validate the agent's work. Out of the loop: the harness is mature enough that the agent runs largely autonomously and the human audits aggregate outputs. Most teams sit 'in the loop,' with senior engineers spending half their day reviewing AI-generated diffs by hand.
Gate 1 Structural: lint, typecheck, unit tests. Gate 2 File integrity: ensures critical files are untouched and changes stay within an allow-list. Gate 3 Sufficiency: verifies the diff covers the scope of the spec and maps to acceptance criteria. Gate 4 Faithfulness: uses embedding-based semantic similarity between the diff and the design markdown as a filter. Gate 5 Judge LLM: a separate model reads the spec, design, tests, and diff and produces an explicit accept or reject with reasons. The order moves from cheap deterministic checks to the most expensive judge model.
The agent that implemented the feature has an incentive built into how it was prompted to converge on 'complete,' creating the same conflict of interest as a developer reviewing their own PR. The fix is a separate evaluator, ideally a different model. In practice this means running one model for implementation and another for judging, where the judge sees only the spec, the design, the tests, and the diff — not the executor's reasoning or chat history — and produces a structured verdict: accept, reject with classified reasons, or request clarification.
Every rejected PR generates a record of what failed, why, and what the fix was. Each week, the team reviews those records and asks whether a guide could have prevented the failure — a CLAUDE.md entry, a skill, a new gate, or a new test. If yes, it is added and committed, so the next PR cannot fail in the same way. A thirty-minute weekly harness retro is recommended on projects where the harness is mature enough to support it.
Douglas started as a Senior FullStack Developer at Cheesecake Labs and currently he's Partner and CTO at the company.