Skip to content

Harness Engineering

In late August 2025, a team at OpenAI made the first commit to an empty git repository. Five months later, that repository contained roughly a million lines of code across application logic, infrastructure, tooling, documentation, and internal developer utilities. Over 1,500 pull requests had been opened and merged. The product had internal daily users and external alpha testers.

None of the code was written by a human.

Three engineers drove the work. They wrote no code themselves. Every line was generated by Codex, OpenAI's coding agent. They averaged 3.5 merged pull requests per engineer per day, and the throughput increased as the team grew to seven. The product shipped, deployed, broke, and got fixed. What the engineers did, from the first commit onward, was something other than programming.

Ryan Lopopolo, writing about the experiment in February 2026, put it plainly: "The primary job of our engineering team became enabling the agents to do useful work." When the agent got stuck, the fix was almost never "try harder." The engineers would step into the task and ask: what capability is missing, and how do we make it both legible and enforceable for the agent?

That question turns out to be the entire discipline.

Where the term came from⚓︎

Mitchell Hashimoto, co-founder of HashiCorp, appears to have crystallized the concept first. In a post describing his six-stage AI adoption journey, Stage 5 is "Engineer the Harness." His definition is concrete: every time you discover an agent has made a mistake, you take the time to engineer a solution so that it can never make that mistake again. Not a better prompt. Not a more detailed instruction. A structural fix to the environment the agent works in.

OpenAI's February 2026 post formalized the practice at scale. A team running Codex against a million-line codebase had accumulated enough operational experience to describe what the discipline actually consists of: context management, architectural enforcement, feedback loops, entropy control. The post is titled "Harness engineering: leveraging Codex in an agent-first world," though as Birgitta Böckeler pointed out, it only mentions "harness" once in the text. The substance mattered more than the naming.

Böckeler, a Distinguished Engineer at Thoughtworks with over twenty years in software delivery, published her response on martinfowler.com six days later. Her contribution was taxonomic. She broke the OpenAI team's practices into three distinct categories and connected them to existing engineering concepts. She also flagged a significant gap: the OpenAI post focused on internal quality and maintainability but said almost nothing about verification of functionality and behavior.

Six weeks after that, Anthropic published a deeper engineering treatment. Prithvi Rajasekaran from their Labs team described a multi-agent architecture that addressed the failure modes Böckeler had identified. Where OpenAI described what a mature harness looks like at rest, Anthropic showed how to build one that catches bugs before humans see them.

Kai Wang, writing for the AI Builder Club, read all three and compressed the convergence into a single line: "It is the control system that keeps agents inside a structured, testable, and reproducible workflow."

Four sources, four vantage points, one architecture emerging.

The three components⚓︎

Böckeler's taxonomy is the cleanest way to parse what a harness actually contains. Three components, each addressing a different failure mode.

Context engineering⚓︎

From the agent's point of view, anything it can't access in-context while running effectively doesn't exist. Knowledge that lives in Google Docs, Slack threads, or people's heads is invisible to the system. Repository-local, versioned artifacts are all the agent can see. Code, markdown, schemas, executable plans.

The OpenAI team learned this early. Progress was slower than expected at the start, and the reason had nothing to do with model capability. The environment was underspecified. The agent lacked the tools, abstractions, and internal structure required to make progress toward high-level goals.

Their first instinct was a comprehensive AGENTS.md. It failed in predictable ways. Context is a scarce resource inside a model's working memory. A giant instruction file crowds out the task, the code, and the relevant docs. The agent either misses key constraints or starts optimizing for the wrong ones. Too much guidance becomes non-guidance. When everything is marked important, nothing is. And a monolithic manual rots instantly. Agents can't distinguish what's still true from what's stale.

So they inverted the approach. AGENTS.md became a table of contents: roughly 100 lines, serving as a map with pointers to deeper sources of truth. The real knowledge base lives in a structured docs/ directory:

AGENTS.md          ← ~100 lines, the map
ARCHITECTURE.md    ← top-level domain map, package layering
docs/
├── design-docs/
│   ├── index.md
│   ├── core-beliefs.md
│   └── ...
├── exec-plans/
│   ├── active/
│   ├── completed/
│   └── tech-debt-tracker.md
├── generated/
│   └── db-schema.md
├── product-specs/
│   ├── index.md
│   └── ...
├── references/
│   ├── design-system-reference-llms.txt
│   └── ...
├── DESIGN.md
├── FRONTEND.md
├── QUALITY_SCORE.md
├── RELIABILITY.md
└── SECURITY.md

This enables what the team calls progressive disclosure. Agents start with a small, stable entry point and learn where to look next, rather than being overwhelmed up front. Design documentation is catalogued and indexed, including verification status and a set of core beliefs that define operating principles. Plans are treated as first-class artifacts: ephemeral lightweight plans for small changes, execution plans with progress and decision logs for complex work. Active plans, completed plans, and known technical debt are all versioned and co-located, allowing agents to operate without relying on external context.

The principle extends well beyond documentation. The team made the application bootable per git worktree, so Codex could launch one instance per change. They wired the Chrome DevTools Protocol into the agent runtime, enabling Codex to take DOM snapshots, navigate the UI, and validate its own changes visually. They built a local observability stack with logs queryable via LogQL, metrics via PromQL, and traces via TraceQL. Each stack is ephemeral, scoped to a worktree, torn down when the task completes. Prompts like "ensure service startup completes in under 800ms" or "no span in these four critical user journeys exceeds two seconds" become tractable because the agent can measure the system it's changing.

The team regularly saw single Codex runs work on a single task for upwards of six hours. Often while the humans were sleeping.

Context engineering is the practice of making the entire operational surface legible to the agent. What the code should do, yes, but also what the code does do right now. UI behavior, performance metrics, runtime state, error logs. If the agent can't observe it, the agent can't reason about it.

Architectural constraints⚓︎

Context tells the agent what to do. Constraints prevent it from doing the wrong thing.

The OpenAI team built their application around a rigid architectural model. Each business domain is divided into a fixed set of layers: Types, Config, Repo, Service, Runtime, UI. Dependency directions are strictly validated. Cross-cutting concerns enter through a single explicit interface called Providers. A limited set of permissible edges. Anything else is disallowed.

These constraints are enforced mechanically. Custom linters and structural tests validate dependency directions, naming conventions, file size limits, structured logging requirements, and platform-specific reliability rules. Because the lints are custom, the team writes error messages that inject remediation instructions directly into agent context. When a lint fails, the agent doesn't just know that something is wrong. It knows what to do about it.

The philosophy: enforce invariants, not implementations. They require Codex to parse data shapes at the boundary, but don't specify which library to use. The model tends to reach for Zod. That's its choice. They care about dependency direction and module boundaries. Within those boundaries, the agent has significant freedom in how solutions are expressed.

Böckeler connected this to a broader observation about the AI coding landscape. Much of the early hype assumed LLMs would give us unlimited flexibility in the target runtime. Generate in any language, any pattern, without constraints. The LLM will figure it out. What the OpenAI team demonstrated is the opposite: increasing trust and reliability required constraining the solution space. Specific architectural patterns. Enforced boundaries. Standardized structures. You trade the "generate anything" flexibility for maintainability you can actually verify.

The team favored what they described as "boring" technology. Dependencies and abstractions that could be fully internalized and reasoned about in-repo. Technologies described as boring tend to be easier for agents to model: composable, stable APIs, well-represented in training data. In some cases, it was cheaper to have the agent reimplement subsets of functionality than to work around opaque upstream behavior from public libraries. One example: rather than pulling in a generic p-limit-style concurrency package, they had Codex implement their own map-with-concurrency helper. Tightly integrated with their OpenTelemetry instrumentation. 100% test coverage. Behaves exactly the way their runtime expects.

This is the kind of rigid architecture most teams postpone until they have hundreds of engineers. With coding agents, it's an early prerequisite. The constraints are what allow speed without decay.

In a human-first workflow, rules like these might feel pedantic or constraining. With agents, they become multipliers. Encoded once, applied everywhere at once.

Entropy management⚓︎

Left alone, a codebase maintained by agents drifts. Codex replicates patterns that already exist in the repository, including suboptimal ones. Over time, this leads to what the OpenAI team experienced firsthand: the accumulation of "AI slop."

Initially, humans cleaned up the mess manually. The team spent every Friday cleaning up agent-generated drift. Twenty percent of the working week, consumed by janitorial work on code no human had written.

It didn't scale.

The fix was to encode what they call "golden principles" directly into the repository and build a recurring cleanup process. These principles are opinionated, mechanical rules that keep the codebase legible for future agent runs. Two examples: prefer shared utility packages over hand-rolled helpers to keep invariants centralized, and validate data boundaries or rely on typed SDKs rather than probing data shapes through trial and error. On a regular cadence, background Codex tasks scan for deviations, update quality grades, and open targeted refactoring pull requests. Most can be reviewed in under a minute and automerged.

Technical debt, the team observed, is like a high-interest loan. Almost always better to pay it down continuously in small increments than to let it compound and tackle it in painful bursts. Human taste is captured once, then enforced continuously on every line of code. Bad patterns get caught on a daily basis, rather than spreading through the codebase for days or weeks before anyone notices.

This functions as garbage collection for code quality. And like garbage collection, it's not something you do once. It's something you run continuously, because entropy is a constant pressure.

Böckeler raised a question that follows naturally from this: can harness techniques be applied to existing applications, or do they only work for codebases built with a harness in mind? For older codebases, full of non-standardized patterns and accumulated entropy, retrofitting a harness may not be worth the effort. She compared it to running a static code analysis tool on a codebase that's never had one, then drowning in alerts. There may be two distinct futures for application maintenance: pre-AI and post-AI, with different economics governing each.

The multi-agent pattern⚓︎

Böckeler's gap observation was specific: the OpenAI post described measures for long-term internal quality and maintainability, but said little about verifying that the software actually works correctly. Anthropic's March 2026 post addressed this directly.

Prithvi Rajasekaran had been working on two problems at Anthropic Labs: getting Claude to produce high-quality frontend designs, and getting it to build complete applications without human intervention. Both had earlier iterations. Both hit ceilings. Decomposing the failures, he identified two persistent failure modes.

Context decay. The longer a task runs, the more the agent drifts from its original intent. As the context window fills, models lose coherence. Some exhibit what Anthropic calls "context anxiety," where they begin wrapping up work prematurely as they approach what they believe is their context limit. With Claude Sonnet 4.5, summarizing earlier conversation in place (compaction) wasn't sufficient. The agent needed full context resets: clearing the window entirely and starting a fresh session with a structured handoff artifact carrying the previous agent's state and next steps.

Self-evaluation blindness. When asked to evaluate their own work, agents consistently respond by praising it. Even when the quality is obviously mediocre to a human observer. The pattern is most pronounced on subjective tasks like frontend design, where there is no binary pass/fail check equivalent to a test suite. Whether a layout feels polished or generic is a judgment call, and agents reliably skew positive when grading their own output. But even on tasks with verifiable outcomes, agents exhibit poor judgment when assessing work they've produced themselves.

The solution draws on the structural insight behind Generative Adversarial Networks: separate generation from evaluation. The separation doesn't immediately eliminate leniency on its own. The evaluator is still an LLM inclined to be generous toward LLM-generated outputs. But tuning a standalone evaluator to be skeptical turns out to be far more tractable than making a generator critical of its own work. Once that external feedback exists, the generator has something concrete to iterate against.

Rajasekaran built a three-agent architecture.

Planner. Takes a one-to-four-sentence prompt and expands it into a full product specification. Instructed to be ambitious about scope and to focus on product context and high-level technical design rather than detailed implementation. The reasoning: if the planner specifies granular technical details upfront and gets something wrong, errors cascade into the downstream build. Better to constrain the deliverables and let the agents figure out the path.

Generator. Implements the spec one feature at a time, working in sprints. Uses a React, Vite, FastAPI, and SQLite stack (later PostgreSQL). Self-evaluates at the end of each sprint before handing off to QA. Has git for version control.

Evaluator. Uses Playwright to drive the running application the way a user would. Tests UI features, API endpoints, and database states. Grades each sprint against both the bugs it finds and a set of criteria covering product depth, functionality, visual design, and code quality. Each criterion has a hard threshold. If any one falls below it, the sprint fails and the generator gets detailed feedback on what went wrong.

Before each sprint, the generator and evaluator negotiate a sprint contract: they agree on what "done" looks like before any code is written. The generator proposes what it will build and how success will be verified. The evaluator reviews the proposal to ensure the right thing is being built. They iterate until they agree. Communication is handled via files. One agent writes; the other reads and responds.

Getting the evaluator to perform at this level required real calibration work. Out of the box, Claude is, by Rajasekaran's own assessment, a poor QA agent. In early runs, he watched it identify legitimate issues, then talk itself into deciding they weren't a big deal and approve the work anyway. It tested superficially rather than probing edge cases. The tuning loop was to read the evaluator's logs, find examples where its judgment diverged from his, and update the prompt to correct for those gaps. It took several rounds before the evaluator graded in a way he found reasonable.

Once calibrated, the evaluator caught real problems. Sprint 3 of a retro video game maker had 27 criteria covering the level editor alone. Some examples of what the evaluator found:

Contract criterion Evaluator finding
Rectangle fill tool allows click-drag to fill area with selected tile FAIL. Tool only places tiles at drag start/end points instead of filling the region. fillRectangle function exists but isn't triggered properly on mouseUp.
User can select and delete placed entity spawn points FAIL. Delete key handler requires both selection and selectedEntityId to be set, but clicking an entity only sets selectedEntityId.
User can reorder animation frames via API FAIL. PUT /frames/reorder route defined after /{frame_id} routes. FastAPI matches "reorder" as a frame_id integer and returns 422.

These are not surface-level observations. The evaluator identified the specific function that wasn't triggering, the exact conditional logic that was wrong, the route ordering that caused a type mismatch. This is the kind of QA feedback that sends an agent straight to the fix, with no additional investigation required.

The biggest difference showed up in the final product. Rajasekaran compared a solo run (no harness, 20 minutes, $9) against the full harness (6 hours, $200) on the same retro game maker prompt. The solo run produced an application that looked reasonable on first glance. Click through it, though, and the problems emerged. The layout wasted space. The workflow was rigid, with no guidance toward the correct sequence. The actual game was broken: entities appeared on screen, but nothing responded to input. The wiring between entity definitions and the game runtime was disconnected, with no surface indication of where.

The harnessed version had rough edges. A character jumped onto a platform and ended up overlapping with it, which felt intuitively wrong. But the core thing worked. You could play the game.

For the frontend design domain, Rajasekaran built a separate generator-evaluator loop running 5 to 15 iterations per generation. He wrote four grading criteria (design quality, originality, craft, and functionality) and weighted design quality and originality more heavily, because Claude already scored well on craft and functionality by default. On design and originality, it reliably produced bland output. The criteria explicitly penalized generic patterns, pushing the model toward aesthetic risk-taking.

The wording of the criteria shaped outputs in ways he didn't fully anticipate. Including the phrase "the best designs are museum quality" pushed designs toward a particular visual convergence. In one example, a Dutch art museum website went through nine iterations of clean, dark-themed conventional refinement. On the tenth cycle, the agent scrapped the approach entirely and reimagined the site as a spatial experience: a 3D room with a checkered floor rendered in CSS perspective, artwork hung on walls in free-form positions, doorway-based navigation between gallery rooms instead of scroll or click. The kind of creative leap that single-pass generation hadn't produced.

The economics⚓︎

The cost question is unavoidable. Here are the numbers from Anthropic's experiments:

Prompt Harness Duration Cost
Retro game maker Solo agent 20 min $9
Retro game maker Full harness (v1) 6 hr $200
Browser DAW Simplified harness (v2) 3 hr 50 min $124.70

Twenty times the cost. The quality difference is categorical, not incremental. The solo run produces something that looks like a working application. The harnessed run produces something that is one. Features function. Bugs are caught and fixed during the build. The application is tested by an agent that interacts with it the way a user would.

On the other side of the ledger, OpenAI estimated their harness approach built their product in roughly one-tenth the time hand-coding would have taken. Three engineers producing 3.5 merged PRs per day each, with throughput increasing as the team grew to seven.

The interesting principle underneath these numbers comes from Rajasekaran: every component in a harness encodes an assumption about what the model can't do on its own. Those assumptions are worth stress-testing, both because they may be wrong and because they go stale as models improve.

He demonstrated this directly. The first version of the harness was built for Claude Opus 4.5, which needed sprint decomposition to maintain coherence across long tasks. When Opus 4.6 shipped, with better planning, longer sustained agentic work, improved self-correction, and stronger long-context retrieval, the sprint construct became unnecessary overhead. The model could run coherently for over two hours without it. Rajasekaran removed it.

With sprints gone, the evaluator's role shifted too. On Opus 4.5, the evaluator caught meaningful issues across every sprint, because the task was at the edge of what the generator could handle solo. On Opus 4.6, the model's baseline capability increased, so the boundary moved outward. Tasks that had needed the evaluator's check to be implemented coherently were now within what the generator handled well on its own. But for the parts of the build still at the edge of the generator's capability, the evaluator continued to provide real lift.

The cost breakdown for the browser DAW shows where time and money actually sit:

Agent & Phase Duration Cost
Planner 4.7 min $0.46
Build (Round 1) 2 hr 7 min $71.08
QA (Round 1) 8.8 min $3.24
Build (Round 2) 1 hr 2 min $36.89
QA (Round 2) 6.8 min $3.09
Build (Round 3) 10.9 min $5.88
QA (Round 3) 9.6 min $4.06

The planner is trivially cheap. The evaluator is cheap per round. Almost all the cost is in the generator building the application. The evaluator adds roughly 8% to total cost while catching bugs that would otherwise require human QA or, worse, ship unfound.

The practical implication: the evaluator is not a fixed yes-or-no decision. It's worth the cost when the task sits beyond what the current model does reliably solo. As models improve, that boundary shifts. Components that were load-bearing last month might be dead weight today.

Rajasekaran framed this as a moving frontier: "The space of interesting harness combinations doesn't shrink as models improve. Instead, it moves, and the interesting work for AI engineers is to keep finding the next novel combination."

What this means in practice⚓︎

All three sources converge on the same structure, and Kai Wang compressed it into a flow:

Constitution + Spec (intent and constraints written down)
  -> Plan + Tasks (with test specs and acceptance criteria upfront)
    -> Generator executes task by task
      -> after each task:
          analyze alignment
          verify tasks
          run tests
      -> if anything fails, send it back and redo
        -> loop until everything passes

Intent into persistent artifacts first. Then enforce validation with an independent evaluator. Loop until everything passes. The specific tools don't matter. The pattern does.

Chad Fowler, writing in "Relocating Rigor," made the connection to existing engineering discipline explicit. Generative systems only work if invariants are explicit rather than implicit. Interfaces must be real contracts, not incidental boundaries. Evaluation must be ruthless. Failures must be loud and immediate.

Look at what the three harness components actually are. Context engineering is documentation: structured, discoverable, version-controlled knowledge that an agent can navigate. Architectural constraints are linting and testing: mechanical enforcement of invariants that keep the codebase coherent. Entropy management is tech debt hygiene: continuous cleanup that prevents drift from compounding into crisis.

These are practices that have existed for decades. The harness didn't invent them. What changed is that when humans wrote the code, these practices competed for attention with the typing. Developers knew documentation mattered, knew linting helped, knew tech debt compounds. But the immediate pressure was always to produce the next feature. The infrastructure of quality was perpetually second priority, perpetually under-invested. It's hard to argue for better linter rules when the sprint board is full.

When agents write the code, the typing disappears as a bottleneck. What remains is everything that was always supposed to surround it. The documentation. The architectural rules. The feedback loops. The cleanup processes. The environment that makes good output possible.

Böckeler made a prediction worth watching. Most organizations have two or three main tech stacks. She imagined a future where teams pick from a set of harnesses for common application topologies, the way they pick from service templates today. Harnesses with custom linters, structural tests, context documentation, and observability integration built in. Start from a harness, shape it over time for the application's specifics.

That this team worked on their harness for five months, she noted, shows this isn't something you jump into for quick results. Building a harness is building the infrastructure that makes everything else possible. It compounds.

Kai Wang's summary holds: "You are no longer writing code. You are writing constraints, acceptance criteria, and feedback loops. Writing prompts is just the beginning. The real job is making sure agents stay controlled and actually work."

The harness is the engineering now.


Sources⚓︎

  • Ryan Lopopolo, "Harness engineering: leveraging Codex in an agent-first world" (February 2026): https://openai.com/index/harness-engineering/
  • Birgitta Böckeler, "Harness Engineering" (February 2026): https://martinfowler.com/articles/exploring-gen-ai/harness-engineering.html
  • Prithvi Rajasekaran, "Harness design for long-running application development" (March 2026): https://www.anthropic.com/engineering/harness-design-long-running-apps
  • Kai Wang, "Harness Engineering: What Engineers Do Now" (March 2026): AI Builder Club
  • Mitchell Hashimoto, "My AI Adoption Journey" (2025): https://mitchellh.com/writing/my-ai-adoption-journey
  • Chad Fowler, "Relocating Rigor" (2026): https://aicoding.leaflet.pub/3mbrvhyye4k2e