Voting with Fire
In Goliath's Curse, the historian Luke Kemp renames civilization. He calls it Goliath: a collection of dominance hierarchies in which some individuals dominate others to control energy and labor. Named for the Bronze Age warrior — imposing in stature, reliant on violence, surprisingly fragile. Across two hundred case studies spanning five millennia, Kemp documents the same structural dynamic. As hierarchical societies age, inequality concentrates, decision-making deteriorates, and the system grows brittle. Complexity scientists call it critical slowing down. A healthy system absorbs shocks and recovers quickly. An extractive one recovers more slowly from each successive disturbance, like an aging body that takes longer to heal from each injury, until eventually a shock that the system would have once absorbed tips it into collapse.
The curse is internal. Goliaths don't die from external assault. They hollow themselves out. The wealth pump transfers resources upward; the exchange between rulers and ruled grows more unequal; elites compete for shrinking returns; and the population that once sustained the structure loses both the incentive and the capacity to defend it. Then drought comes, or invasion, or rebellion, and the system that looked permanent proves to have been perched on a knife's edge for decades.
Software engineers will recognize this dynamic because they live inside it.
A codebase begins as a direct expression of intent: small, legible, responsive to change. As it ages, it accumulates. Dependencies multiply. Workarounds compensate for workarounds. Every new feature taxes existing architecture, and every accommodation makes the next change harder. The system grows extractive: more effort goes toward maintaining the structure than producing value. Velocity decays. Incident recovery stretches. The valley gets shallower. Engineers who once built the system now serve it, their labor consumed by the reproduction of complexity that benefits no one except its own continuation.
This is Goliath's Curse applied to software. The codebase doesn't fail because of a single shock: a framework end-of-life, a key engineer leaving, a compliance mandate. It fails because it hollowed itself out long before the shock arrived.
Lamentation Literature⚓︎
Kemp identifies a genre he calls lamentation literature: texts written by displaced elites mourning the collapse of the old order. The archetype is the Admonitions of Ipuwer, an Egyptian text from roughly 2000 BCE, in which a scribe laments the fall of the Old Kingdom. The scribe spends almost no time on war or famine. His real horror is that servants now speak freely, the poor own tombs, women who once had nothing possess luxury goods, as though the dissolution of social hierarchy were itself a more grievous catastrophe than the dissolution of the state.
Every mature codebase generates its own lamentation literature. "We can't possibly rewrite this." "You don't understand the complexity." "It would take years to reproduce what we have." These statements present themselves as engineering assessments, sober acknowledgments of accumulated value. Read through Kemp's lens, they look different. They are the voice of invested incumbency: architects, senior engineers, team leads whose status derives from knowledge of the existing system. The complexity they defend is real. Whether it serves the engineers who work in it or the structure that produced it is a separate question, and the one that matters.
Two independent lineages of practice are converging on an answer. The first comes from spec-driven development, which has crystallized into an explicit engineering position. Ankit Jain, writing in Latent Space, states it without qualification: "Specs become the source of truth. Code becomes an artifact of the spec." StrongDM's internal rule: code must not be written by humans; code must not be reviewed by humans. The second lineage comes from eval-driven development: Hamel Husain and Jeremy Howard's insistence that the bottleneck in AI-assisted development is evaluation, not generation. If you can't measure whether the output is correct, you can't iterate. Write the evaluations first; the implementation follows.
These lineages meet in a specific place. In early 2026, OpenAI disclosed that their new product, one million lines of code, contained zero manually written code, built over five months. The engineering team's contribution was the environment: repository structure, architectural constraints, agent-accessible documentation, mechanical enforcement. Code was generated by agents. The environment, the harness within which generation occurred and the specifications against which output was verified, was the human artifact. The distinction between these two kinds of work is the hinge on which everything turns.
Striking the Match⚓︎
Kemp draws a distinction that matters more than any other in his book: the difference between Goliath falling on its own and being deliberately killed.
The evidence is archaeological. At Teotihuacan in central Mexico, sometime around 200 CE, a pyramid was built over the bodies of two hundred human sacrifices. Within fifty years, the city's inhabitants desecrated the pyramid in a termination ritual. No new pyramids were built. No more people were sacrificed. Wealth and energy were channeled instead into apartment complexes with internal drainage, plastered walls, and income inequality under half that of modern Denmark. When inequality later crept back through gated communities and isolated trade routes, the citizens responded with systematic destruction: over 150 buildings burned, every temple and pyramid demolished. Not by invaders. Not by enemies. By the people who lived there, choosing to destroy the structures that had come to dominate them.
At Tiwanaku in Bolivia, the monolithic Gateway of the Sun was toppled and storage jars smashed in what bears the hallmarks of revolution: a deliberate dismantling of the physical infrastructure through which the ruling class had organized ritual life and concentrated surplus. The communities that formed afterward replicated old ceremonies in open, collective spaces while leaving the elite architecture in rubble. At Taosi in China, the city wall was razed, elite quarters were retrofitted into stone-tool workshops, royal tombs destroyed, and the former palace occupied by commoners surrounded by rubbish pits. The city lived on for two hundred more years. Archaeologists labelled the period "anarchy." Kemp notes that the city grew.
Voting with fire. Choosing to destroy the accumulated structure and rebuild from what actually matters.
The parallel in software has a name. A conversation with my engineering team crystallized what I'd been circling for months: the metric that actually matters is how well you can automate recreating the product from scratch. I call it the phoenix test. Feed your specifications and acceptance tests to agents. Destroy the codebase. Measure what returns.
The test works as a diagnostic. When the regenerated system breaks, the failure is legible: a specification that didn't capture the vendor API's actual behavior, a business rule that lived only in a senior engineer's memory, practical wisdom, what James C. Scott calls metis, embedded in code that no document had touched. One concrete instantiation: Playwright acceptance tests that completely describe the business cases for the UI. Tests that function as executable specifications, in a language agents can consume and target, capturing business behavior without prescribing implementation. The durable artifact is the eval. The code is what the eval verifies and discards.
Kemp's evidence on this point is unambiguous. The depth of destruction determines the quality of renewal. Greece after the Late Bronze Age collapse lost its writing system, its palace economies, and its cities, but kept the cultural memory and the particular geography — islands, mountains, a coastline fractured into natural harbors — that enabled communities to exit dominant structures and self-govern. It adopted an entirely new alphabet from the Phoenicians, as if the old notation system itself carried too much of the palace bureaucracy's DNA to be worth preserving. What emerged, over centuries, was the democratic polis — the city-state, participatory governance, the cultural flourishing we now treat as foundational to Western civilization. Kemp's formula: "the deeper the fall, the more likely it is that we'll rise again as democratic equals." The Assyrians, who weathered the same Bronze Age collapse with their monarchy intact, maintained their kings. The societies that burned everything got Athens.
Incremental refactoring is the Assyrian model. It preserves the existing architecture's assumptions: its hierarchies, its accumulated compromises, its structural debt. The phoenix, the regeneration from specifications, is the Greek model. The chance to emerge with something structurally better. But the chance materializes only if the specifications encode what actually serves the people who use the system, rather than what the old structure happened to accumulate.
What Survives the Fall⚓︎
Kemp's most important empirical finding: "The people and culture of collapsed states often persist after the fall." Rome's language, law, and institutions survived the empire by millennia. Quechua is still spoken by ten million people. The structure is fragile; Kemp's two hundred case studies leave no doubt. The knowledge is a different matter entirely, more durable than the arrangement that once housed it.
In a spec-driven world, the specification, the domain model, the eval suite are the cultural memory of the codebase. They encode what the system does, why it does it, and how to verify that any implementation meets the standard. What burns when the codebase burns is the arrangement: the particular accumulation of dependencies, workarounds, architectural decisions, and compromises that constituted this implementation. The knowledge persists; the extraction apparatus doesn't.
Intellectual honesty requires engaging with what might be lost for real, and here Scott's concept of metis cuts deepest: the practical, context-dependent knowledge that accrues in a living codebase but exists in no specification, no architecture diagram, no product document. The workaround for the vendor API that lies about its response format. The function that is technically wrong but compensates for a bug in a dependency that will never be fixed, and that three engineers over five years have independently considered removing before discovering why it exists. The naming convention encoding institutional memory of a failed migration three years ago. The comment that says // DO NOT CHANGE — see incident #4471.
Some of this is genuine metis, practical knowledge that no specification captures because no specification knew to ask, while some is what Kemp would recognize as the residue of extraction: compromises that serve the structure rather than the people who work within it, workarounds whose original justification evaporated years ago but whose removal would require understanding a dependency chain that nobody alive in the organization fully maps. The phoenix test distinguishes between the two. What breaks when you regenerate tells you what the specs didn't know. What works better tells you what the old system was carrying that served no one.
Post-collapse societies made this distinction with remarkable clarity. The people who left Tiwanaku replicated ceremonies that served the community and left elite monuments in rubble. The Cherokee replaced a centralized priesthood, the Ani-Kutani, with individual healers and conjurers, while the Huhugam stopped burying elites in special coffins and began cremating all the dead equally, a transformation in which the ritual technology persisted but the hierarchy it had served was deliberately excised. The pyramid cities of Cahokia, Moundville, and the Huhugam dispersed into smaller communities, and the people reorganized religion, governance, and food production specifically to prevent the old hierarchy from re-emerging. These weren't failures of memory. They were acts of curation, deliberate choices about which knowledge to carry forward and which to leave in the ruins.
Curation requires preparation. Every successful case of deliberate destruction in Kemp's evidence shares a common feature: the communities that rebuilt had done specific, often invisible work before the fire. The question that matters for software engineering is not whether deliberate destruction can produce better systems (the evidence is clear) but what a codebase designed for regeneration looks like before the match is struck.
The Preconditions⚓︎
Kemp's most consistent finding about successful fire: it requires remembered alternatives. The Teotihuacan citizens who burned 150 buildings had spent generations in the pre-pyramid city, where no artwork shows one human subjugating another. San Lorenzo's residents who smashed the great stone heads still carried hunter-gatherer traditions in living memory. The people at every successful site of deliberate destruction already knew what the non-hierarchical version looked like, because they had lived it or their parents had.
You cannot phoenix a codebase that has no specifications. The match cannot be struck if the cultural memory doesn't exist yet. Most engineering teams are in exactly this position: their knowledge lives in the code. The business rules are embedded in conditionals. The vendor API's actual behavior, as opposed to its documented behavior, is encoded in workarounds. The domain model exists as a network of implicit assumptions distributed across thousands of files. Burning the code would burn the knowledge.
The first precondition for the phoenix is writing the specifications while the system still works. Not as documentation of what the code does. That is archaeology performed on a living city, and it misses the same things post-hoc archaeology always misses. Executable specifications: Playwright acceptance tests that describe what the system owes its users, API contracts that define boundaries, schema definitions that capture the data model, performance thresholds expressed as verifiable assertions. Written to describe the behavior the code must produce. A description of code encodes the implementation's assumptions; a description of behavior stands independent of any particular implementation, which is what makes it survivable.
The second precondition is exit options. Kemp shows that fire only succeeds when people can leave: Greece's islands and mountains, the open plains beyond Cahokia's palisade walls. In software, exit options mean the team is not hostage to the codebase. Every piece of metis that exists only in code or only in an engineer's head is a chain binding the organization to the current implementation. Extracting that knowledge into specifications is the engineering equivalent of building roads out of the city before you set it alight.
The third precondition is networks. Neanderthals lived in small, isolated groups; when crisis hit, they couldn't recover. Homo sapiens maintained sprawling networks that functioned as genetic and cultural safety nets. The engineering parallel is interoperability: a system built on documented contracts, standard protocols, and explicit boundaries can be reassembled from components. A system built on implicit assumptions and undocumented integrations can only be maintained, never regenerated. The difference between a network and an island is the difference between a codebase that can rise from its specifications and one that dies with its implementation.
The Termination Ritual⚓︎
In Mesoamerica, the practice of defacing, smashing, and burning ritual objects has a name: the termination ritual. It severs the cosmic connection between a site and a deity. The plumed serpent statues at Teotihuacan weren't merely destroyed — they were ritually defaced, their authority formally revoked. This wasn't vandalism. It was liturgy: a community formally declaring that the old order's legitimacy had ended.
The phoenix test is the termination ritual applied to software. The metric: how well can you automate recreating the product from scratch? This is not a thought experiment. It is a measurable quantity: the percentage of business behavior that your specification suite can regenerate without human intervention. Run it on a module. Run it on a feature. Run it on the entire application. The gap between what returns and what the production system does is the direct measurement of accumulated Goliath: the complexity that serves the structure rather than the users.
When the phoenix test fails, the failure is its most valuable output. A specification that didn't capture the vendor API's actual behavior surfaces a piece of metis that no one knew was missing. A business rule that lived only in a senior engineer's memory becomes visible because the regenerated system lacks it. The function that compensates for a dependency bug that will never be fixed, the comment that says // DO NOT CHANGE. Each failure reveals itself as either genuine practical wisdom that the specs must learn to encode, or residue of extraction that can finally be released.
The critical insight is cadence. The termination ritual is not a once-in-a-decade dramatic rewrite. The Lakota assembled authoritarian command structures for every bison hunt and dissolved them when the hunt ended — not once, but seasonally, as a regular practice. The phoenix test works at the same rhythm. Run it against individual features weekly. Run it against modules monthly. Each run reveals gaps. Each gap, once encoded into specifications, makes the next run more complete. The specifications get richer. The code gets more disposable. The system learns what it actually needs by repeatedly discovering what it can regenerate.
The Harness as Cultural Memory⚓︎
Rome's language, law, and institutions survived the empire by millennia. The people who left Tiwanaku replicated old ceremonies in open, collective spaces while leaving elite monuments in rubble. The pattern holds across every case: the infrastructure fell while the cultural knowledge outlasted it by centuries.
What survives the phoenix is the harness: the environment within which code is generated, verified, and deployed.
OpenAI's experiment demonstrated what the harness can look like at scale. Their engineering team's primary contribution was not code but environment: a structured repository treated as the system of record, where all knowledge lives as versioned, agent-readable artifacts. Architectural constraints enforced mechanically through custom linters. Progressive disclosure: a short map pointing to structured documentation, not a thousand-page instruction manual. Recurring agents that scan for deviations and enforce quality. The result: a million lines of code, hundreds of users, three to seven engineers steering, zero lines written by hand.
Their most important discovery: early progress was slow not because the agent was incapable, but because the environment was underspecified. Every fix pointed in the same direction: identifying what capability was missing and making it legible and enforceable for the agent. The discipline they developed (designing environments, specifying intent, building feedback loops) is what makes agents effective. This is harness engineering, and it is the necessary foundation regardless of whether the codebase it serves is permanent or seasonal.
But OpenAI built their harness for permanence. Code accumulates. The harness maintains it. In Kemp's terms, this is the Assyrian model: weathering each crisis with the monarchy intact. The architecture stays. The complexity compounds. The harness prevents collapse but not the gradual hollowing-out that Goliath's Curse predicts.
The phoenix harness is designed differently. Its purpose is not to sustain the code but to survive the code's destruction. The key structural move is distinguishing between two kinds of artifacts in the repository: the durable and the disposable. The durable artifacts — specifications, acceptance tests, architectural constraints, domain models, quality invariants — are the cultural memory. They are maintained with the same rigor that production code receives today, because in a phoenix architecture, they are the product. The disposable artifacts — application code, configuration, infrastructure definitions, internal tooling — are generated from the durable artifacts and can be regenerated at will.
The hierarchy of care inverts. Specifications are reviewed with scrutiny. Code is reviewed by agents. The human contribution migrates from writing implementations to curating the knowledge that generates them, in the same way post-collapse communities maintained ceremonies in open spaces while leaving palaces in rubble. The ceremonies persisted because they served the community. The palaces were rebuilt as each generation saw fit.
In concrete terms, the durable layer of a phoenix repository might contain:
| Layer | Artifacts | What it specifies |
|---|---|---|
| Behavioral | Playwright acceptance tests, user journey specs | What the system does for users (pass/fail) |
| Contractual | OpenAPI specs, schema definitions, integration contracts | External boundaries the system must honor (exact) |
| Structural | Architectural linters, dependency direction rules, quality thresholds | How the system is constrained (invariants only) |
| Knowledge | Design decisions, product principles, domain model, encoded metis | Why decisions were made and what the domain means |
| Orchestration | Generation workflows, CI templates, eval harnesses | How to build and verify (the process itself) |
Everything outside this structure, the application code, the configuration, the built assets, is generated and regenerable. This table is the cultural memory. It is the language, law, and ceremony that survive the fall.
What the specifications encode matters as much as whether they exist. A codebase carries what Pierre Bourdieu called habitus: durable dispositions that actors internalize to the point of unconsciousness. Naming conventions, architectural patterns, the implicit choices about what goes where. These instincts reproduce the existing structure, and they are what the lamentation literature defends. The feel of the system.
When agents regenerate from specifications, the implementation carries a different habitus, shaped by the agent's training rather than the original team's practice. The phoenix may look and feel different from the original. File structures, component patterns, and internal architecture may all reorganize. This is not a failure of specification. It is the mechanism through which the phoenix escapes the old structure's accumulated habits, and it must be understood by anyone commissioning the fire: the new city will serve the same people, answer the same needs, pass the same acceptance tests. It will not replicate the old city's floor plan.
This produces a tension that the specifications must resolve. Specify too much (folder layout, component naming, internal architecture) and the phoenix reproduces the old system deterministically, rising as the same bird. Specify too little and it fails its users. Post-collapse societies resolved this by specifying ceremonies, what to perform and what community needs to serve, while leaving architecture free. They did not rebuild the same palaces.
| Specify (outward-facing) | Leave free (inward-facing) |
|---|---|
| User behavior (acceptance tests) | Folder structure |
| API contracts (OpenAPI, GraphQL) | Framework choice |
| Data schemas and integration points | Component design |
| Performance and quality thresholds | Internal architecture |
| Accessibility and security requirements | Naming conventions |
The phoenix should pass all the tests and honor all the contracts. It may use a different framework, a different internal architecture, different conventions. The variation is diagnostic: it reveals which choices in the old system were structurally necessary and which were habits the system carried for its own sake.
Counter-Dominance⚓︎
The Cherokee encoded oral traditions warning each generation against centralized priesthoods. The Huhugam redesigned burial practices so no one could accumulate status through death. The Puebloans told the story of the Great Gambler, a figure who tried to own everyone's property and was overthrown by the disenfranchised, as a cultural vaccine against the re-emergence of hierarchy. These weren't passive memories. They were active immune systems: practices specifically designed to detect and destroy dominance before it could consolidate.
Every codebase, even one generated from specifications, accumulates entropy. Patterns drift. Workarounds appear. Abstractions calcify. OpenAI discovered this directly: their team spent twenty percent of every Friday cleaning up "AI slop," suboptimal patterns that agents replicated because they already existed in the repository. Their solution was to encode what they called "golden principles," opinionated mechanical rules, and deploy recurring agents that scan for deviations, update quality grades, and open targeted refactoring pull requests. Human taste captured once, enforced continuously on every line of code.
This is counter-dominance engineering: the practice of building active immune systems into the codebase to detect and prevent the re-accumulation of structural complexity. But the phoenix model offers something that garbage collection within a permanent codebase cannot. When accumulated entropy exceeds a threshold — when the gap between what the phoenix test regenerates and what the production system runs grows too wide: you regenerate. The specifications have been continuously enriched by every failure the previous phoenix tests revealed. The architectural constraints have been tightened by every drift the counter-dominance agents detected. The domain model has absorbed every piece of metis that the termination ritual surfaced. And so the next generation of code, risen from these richer specifications, starts cleaner than any refactoring could have achieved, because the knowledge had already been extracted from the code into the specifications before the fire was lit.
The Seasonal Codebase⚓︎
The most radical insight in Kemp's work is not about collapse. It's about what came before.
For roughly 290,000 of the 300,000 years Homo sapiens has existed, the default social mode was oscillation. The Lakota, Cheyenne, and Shoshone built authoritarian command structures for bison hunts: scouts, police, strict hierarchies of decision-making enforced with real consequences for anyone who broke formation and spooked the herd. When the hunt ended, the structures dissolved completely and authority returned to its baseline, because the hierarchy had been understood from the start as instrumental to a specific purpose rather than as a permanent ordering of social life.
"Disassembling a hierarchy is neither scary nor remarkable," Kemp writes, "if that hierarchy was never meant to be permanent."
The permanent complex society — the city, the empire, the Goliath — is a ten-thousand-year experiment that, on the timescale of the species, amounts to an anomaly, and the evidence suggests it has been a costly one. Kemp documents that egalitarian settlements in southwest Asia averaged over three thousand years of persistence, and that longevity declined as hierarchy increased. The more complex the structure, the shorter its lifespan.
The permanent codebase — the monolith that accumulates over years, that engineers maintain and extend and refactor and deprecate — is software's Goliath. The arrangement we've treated as the natural order. But the convergence of spec-driven development, eval engineering, and generative AI points toward something different: that the code itself could become seasonal. Assembled from specifications for current need, regenerated when requirements shift, and never treated as a monument to be preserved, because the implementation, like the bison-hunt hierarchy, was understood from the start as instrumental to a purpose rather than as a permanent arrangement of institutional life.
Seasonal implies a rhythm. The question is how long each phoenix should live. The archaeological evidence suggests two cadences operating simultaneously. The Lakota's oscillation was regular: assemble for the hunt, dissolve when it ends, repeat with the seasons. The larger fires (Teotihuacan, Taosi, the North American pyramid cities) were event-driven: triggered when inequality re-accumulated past a threshold the community would tolerate.
A phoenix codebase operates on both. Individual features and modules are regenerated continuously, tested against their specifications on a rolling basis, the way a forest replaces individual trees at different ages. Full system regeneration is triggered by signals: the phoenix test score declining across domains, a major shift in requirements that makes the current architecture structurally wrong, or a leap in agent capability that means regeneration would produce better output from the same specifications.
| Cadence | Scope | Trigger |
|---|---|---|
| Continuous | Individual features | Phoenix test score per feature declines |
| Monthly | Modules and domains | Accumulated entropy detected by counter-dominance agents |
| Quarterly | Architecture review | Gap analysis between spec coverage and production behavior |
| Event-driven | Full regeneration | Major requirement change, platform shift, or agent capability leap |
Each generation of code lives until the specifications outgrow it. Each generation of specifications is richer than the last, because every phoenix test that revealed a gap produced a new specification to fill it. The system improves by improving the knowledge from which the code is generated.
This is not disposability as nihilism, not the accelerationist's "move fast and break things" dressed in archaeological clothing. It's disposability as the oldest form of resilience our species knows. The Lakota didn't lack the capacity to maintain permanent hierarchies. They chose otherwise, because they understood that permanent structures accumulate the interests of those who sit atop them. The seasonal structure serves the hunt. The permanent structure serves itself.
The eval suite endures. The domain model endures. The acceptance tests that encode business behavior endure. The harness — the environment, the constraints, the maps that enable agents to generate reliable code — endures, carrying forward in executable form the accumulated understanding of what the system owes its users. These are the cultural memory: the language, the law, the ceremonies that outlast the fall of any particular implementation.
What we've been calling "the codebase" is the palace, the pyramid, the city wall. It can be destroyed. It can be rebuilt from specifications that encode what actually matters. And what emerges from the fire, if the specifications are honest, if the evals capture what the system owes its users rather than what the old structure happened to accumulate, may be freer, more responsive, and more structurally sound than anything the old Goliath could have produced.
Teotihuacan's citizens burned 150 buildings and built apartment complexes more equal than Denmark. They knew, before they struck the match, what the city was for.
Sources⚓︎
- Luke Kemp, Goliath's Curse: What the Rise and Fall of Empires Tells Us About the Modern World (Allen Lane, 2025)
- James C. Scott, Seeing Like a State: How Certain Schemes to Improve the Human Condition Have Failed (Yale University Press, 1998)
- Hamel Husain and Shreya Shankar, "Your AI Product Needs Evals" (2025): https://hamel.dev/blog/posts/evals-faq/
- Ankit Jain, "How to Kill the Code Review," Latent Space (2026)
- Ryan Lopopolo, "Harness Engineering: Leveraging Codex in an Agent-First World," OpenAI (2026)