Skip to content

Archive⚓︎

Harness Engineering: A Composable Architecture

Three engineers at OpenAI produced a million lines of code last year. None of it was written by hand. What made that work was the structure around the model, not the model itself: a pipeline where quality gates check every transition and failures loop back as structural fixes rather than prompt patches.

Tatsunori Hashimoto named the discipline: when an agent makes a mistake, engineer a structural fix so it can never make that mistake again. Birgitta Böckeler, writing on martinfowler.com, distinguished what steers agents before they act from what corrects them after, mapping a taxonomy of guides and sensors. Anthropic's multi-agent research showed the shape at a different scale: separate generation from evaluation, make the evaluator skeptical, loop until everything passes. Four groups, different problems, the same skeleton.

That skeleton is a pipe. A signal enters, gets transformed through stages, and produces an artifact, with quality checked at every seam. The scientific method follows this shape; so does OODA. The structure is older than software, the basic form of structured inquiry.

But if harness engineering names only the pipe, then what has been named is not new. The field needed explicit terminology for the age of agents, and the terminology is valuable. The question is whether an architecture exists underneath: something an engineer can compose and configure, something that explains how the system that produces runs gets better over time. The pipe comes first, because you need the skeleton before you can see what has grown around it.

Harness Engineering

In late August 2025, a team at OpenAI made the first commit to an empty git repository. Five months later, that repository contained roughly a million lines of code across application logic, infrastructure, tooling, documentation, and internal developer utilities. Over 1,500 pull requests had been opened and merged. The product had internal daily users and external alpha testers.

None of the code was written by a human.

Three engineers drove the work. They wrote no code themselves. Every line was generated by Codex, OpenAI's coding agent. They averaged 3.5 merged pull requests per engineer per day, and the throughput increased as the team grew to seven. The product shipped, deployed, broke, and got fixed. What the engineers did, from the first commit onward, was something other than programming.

Ryan Lopopolo, writing about the experiment in February 2026, put it plainly: "The primary job of our engineering team became enabling the agents to do useful work." When the agent got stuck, the fix was almost never "try harder." The engineers would step into the task and ask: what capability is missing, and how do we make it both legible and enforceable for the agent?

That question turns out to be the entire discipline.

The Phoenix Test

"Voting With Fire" made the case for the phoenix: a codebase designed for regeneration rather than permanence, where specifications and acceptance tests constitute the durable layer and application code is disposable. But it left a question open that the archaeological evidence answers with uncomfortable precision. Which specifications? Which tests? What, exactly, needs to survive the fire — and what looks durable but will burn with the structure it serves?

Luke Kemp's evidence across two hundred cases of civilizational collapse reveals a variable that determines everything, a variable that has nothing to do with the knowledge itself and everything to do with the social arrangement that carries it. The same knowledge, held differently, has radically different survival properties. The critical question is never what the knowledge is. It is who holds it, how it travels, and whether it serves the community or the palace.

Voting with Fire

In Goliath's Curse, the historian Luke Kemp renames civilization. He calls it Goliath: a collection of dominance hierarchies in which some individuals dominate others to control energy and labor. Named for the Bronze Age warrior — imposing in stature, reliant on violence, surprisingly fragile. Across two hundred case studies spanning five millennia, Kemp documents the same structural dynamic. As hierarchical societies age, inequality concentrates, decision-making deteriorates, and the system grows brittle. Complexity scientists call it critical slowing down. A healthy system absorbs shocks and recovers quickly. An extractive one recovers more slowly from each successive disturbance, like an aging body that takes longer to heal from each injury, until eventually a shock that the system would have once absorbed tips it into collapse.

The curse is internal. Goliaths don't die from external assault. They hollow themselves out. The wealth pump transfers resources upward; the exchange between rulers and ruled grows more unequal; elites compete for shrinking returns; and the population that once sustained the structure loses both the incentive and the capacity to defend it. Then drought comes, or invasion, or rebellion, and the system that looked permanent proves to have been perched on a knife's edge for decades.

Software engineers will recognize this dynamic because they live inside it.

DISCOVER

Software development has two scoreboards and a process between them. The first measures deployment: how fast code ships, how often it breaks, how quickly you recover. The second measures the organization: how fast signals reach the right person, how autonomous the response is. Comprehension — understanding the system well enough to decide and act correctly — sits between them.

The progression follows a logic that Simon Wardley mapped in his work on technology evolution: every capability that becomes commodity accelerates everything that depends on it, exposing the next constraint beneath.

The Decision Loop

DORA measures how fast code flows from commit to production. MOVE measures how fast the organization senses, decides, and acts. Both track real performance. Both share an assumption: that someone, somewhere, understood the system well enough to make the right call.

That assumption has a cost, and it's larger than most organizations realize.

The Replacement Rate

Two guys in the jungle. A tiger charges. One kneels to tighten his shoelaces. The other yells: "You can't outrun a tiger!" First guy: "I don't have to outrun the tiger. I only have to outrun you."

Thorsten Ball used this joke recently to make a point about AI and the average software engineer. The joke is more precise than he may have intended. It contains, in five sentences, both a correct economic model and a game-theoretic trap. The model: your value isn't absolute; it's relative to the next-best alternative. The trap: when everyone tightens their shoes, the tiger catches someone anyway, and the race never ends.

Sports analytics formalised this intuition decades ago. The framework is called VORP: Value Over Replacement Player.

The Phantom Limb Economy

There are more employed musicians in the United States today than at any point since 1850. Over 221,000 of them, according to the US Census Bureau. The number gets cited with comforting regularity every time new technology threatens creative work. Phonograph? Musicians survived. Radio? Still here. Streaming? More than ever.

The data is real enough; it just doesn't tell you what you think it does. Arun Panangatt took the 221,000 figure apart, and what sits inside it undermines the argument the number is usually recruited to make.

MOVE: Metrics for the AI-Native Organization

We spent a decade measuring how fast teams ship code. Now the question is how fast the whole organization senses, decides, and acts.

MOVE measures what DORA cannot — how effectively an organization operates when intelligent systems participate in execution. Any organization can buy AI. MOVE asks whether AI changed how the organization operates.

The AI Capability Map: An Expanded Inventory

You don't get to opt out of commodity AI. That's what "commodity" means: not "cheap" or "boring" but "compulsory." Ivan Illich saw this pattern with electricity, automobiles, schools. The moment something becomes a utility, non-participation becomes deviance. Prasad Prabhakaran's recent Wardley map of enterprise AI capabilities plots where different technologies sit on the evolution axis. The map is useful. But its most important insight is implicit: everything in the Commodity column is no longer a choice.

What follows is an expanded inventory: the original categories, what's missing from each, and the harder question of what the categories themselves fail to capture. The act of mapping shapes what gets mapped. The categories we use determine the investments we make. And some capabilities don't fit the Genesis-to-Commodity axis at all.