The Ghost of Verification
A surgical checklist, a CI pipeline, a red team, and a free press walk into a bar. They are all the same thing.
“Trust, but verify.” — Russian proverb, famously adopted by Ronald Reagan
In 2001, in a surgical ICU at Johns Hopkins Hospital in Baltimore, a critical-care specialist named Peter Pronovost watched another patient die of a central-line infection. The patient was a middle-aged man. The infection was a bloodstream infection caused by the catheter — a tube inserted into a large vein in his chest. The death was preventable. Not in the loose sense that most deaths are theoretically preventable with better luck or better technology. Preventable in the specific, infuriating sense that everyone in the room already knew exactly how to prevent it.
The five steps that prevent catheter-related bloodstream infections had been published for years: wash your hands, clean the patient’s skin with chlorhexidine, put sterile drapes over the entire patient, wear a sterile mask and gown and gloves, and put a sterile dressing over the insertion site once the line is in. Five steps. None of them controversial. None of them expensive. None of them requiring new equipment or new skills. The problem wasn’t knowledge. It was compliance. On any given day, at one of the best hospitals on earth, at least one of these five steps was skipped in roughly a third of catheter insertions. Doctors forgot. Doctors were rushed. Doctors thought they’d done it when they hadn’t. The steps were simple. Doing them every single time, across thousands of procedures, was not.
So Pronovost did something that felt almost insultingly simple. He put the five steps on a sheet of paper. A checklist. One column, five rows, a checkbox next to each. He then did something far less simple: he convinced the hospital administration to give nurses the authority to stop a procedure if a doctor skipped a step. In the hierarchical culture of American medicine, where surgeons occupied a status somewhere between airline captains and minor deities, this was roughly equivalent to giving a flight attendant authority to override the captain mid-takeoff. It was a political intervention disguised as a bureaucratic one.
The results were staggering. Over the next eighteen months at Johns Hopkins, the ten-day line-infection rate dropped from 11% to zero. Not “significantly reduced.” Zero. Pronovost estimated that the checklist had prevented forty-three infections and eight deaths — in a single ICU, in a single hospital, in a single eighteen-month period. Michigan later adopted the checklist statewide through the Keystone Initiative. Within three months, catheter infections dropped 66%. Within eighteen months, the program had saved an estimated 1,500 lives and $175 million in healthcare costs.
Atul Gawande documented the story in The Checklist Manifesto and it has since become one of the most cited examples of a low-cost, high-impact intervention in modern medicine. But here’s what’s strange about it: the hero of this story isn’t the surgeon. It isn’t the nurse. It isn’t even Pronovost, though he deserved every award he received. The hero is the sheet of paper. A stupid, boring, unimaginative sheet of paper that asked five questions and demanded five answers before a needle went into a patient’s chest. The sheet of paper had no medical knowledge. It couldn’t adapt to unusual cases. It couldn’t exercise clinical judgment. All it could do was ask the same five questions, in the same order, every single time, and refuse to go away until someone answered them.
Now hold that image — a checklist in an operating room — and jump to an entirely different world.
At Bloomberg, Matthew Scherer wrote about how software teams running AI systems discovered that the hardest part of deploying machine learning in production wasn’t training the model. It was verifying that the model still worked the way you thought it did after each update. They built elaborate testing pipelines — automated checks that ran every time code changed, confirming that outputs matched expectations, that latency stayed within bounds, that edge cases didn’t produce garbage.
In data engineering, Great Expectations became one of the fastest-growing open-source libraries by doing nothing more than letting teams write assertions about their data: this column should never be null, this value should always be between 0 and 1, this table should have exactly as many rows today as it did yesterday, plus or minus 5%. Data contracts — verification for datasets.
In security, Bruce Schneier has spent decades arguing in Schneier on Security that the fundamental problem isn’t building secure systems. It’s verifying that systems remain secure over time, against adversaries who are actively trying to circumvent your defenses. His phrase is memorable: “Security is a process, not a product.”
Different worlds. Different stakes. Different vocabularies. But the same ghost is haunting all of them. The ghost is verification.
The Oddities
Before we define what verification is, notice some oddities about it.
First, the hardest problems in computer science don’t resist verification because we lack the tools. They resist it because the problem itself is poorly specified. Leslie Lamport, who won the Turing Award for his work on distributed systems, has pointed out that formal verification — the mathematical kind, where you prove a program correct — is only as good as your specification. You can formally verify that a program does what the spec says. You cannot formally verify that the spec captures what you actually wanted. There is always a gap between the formal model and reality, and that gap is where bugs live.
Second, verification has a jagged frontier. Andrej Karpathy described GPT-4 and similar models as having a “jagged intelligence” — brilliant at some tasks, bafflingly incompetent at others, with no obvious pattern to which is which. Verification has the same shape. Some systems are trivially verifiable (does the bridge hold the weight? load it and measure). Others are almost impossible (does this hiring algorithm discriminate? against what baseline? measured how? over what time frame?). The boundary between the verifiable and the unverifiable is itself poorly understood.
Third, there’s a political dimension. Ivan Zhao, the founder of Notion, has spoken about how software tools shape the epistemology of organizations — what counts as knowledge, what counts as evidence, what questions are even possible to ask. Verification is not neutral. Choosing what to verify is choosing what matters. Choosing not to verify is, very often, choosing not to know.
This cuts in dark directions too. Rwanda’s government systematically suppressed independent verification mechanisms — press freedom, judicial independence, international monitoring — in the years before and during the 1994 genocide. The absence of verification didn’t create the violence, but it removed every institutional mechanism that might have slowed or exposed it earlier. Verification isn’t just a technical mechanism. It’s an immune system. And when you suppress it, you’re not just losing information. You’re losing the capacity to course-correct.
The pattern is consistent across political contexts. Authoritarian regimes don’t suppress verification mechanisms by accident. They do it systematically, because independent verification is structurally incompatible with unchecked power. A free press, an independent judiciary, an international monitoring body — these are all verification mechanisms applied to governance. Remove them and you don’t just lose information. You lose the system’s ability to detect and correct its own errors. You lose the immune system.
So what is verification, really?
Verification vs. Validation
The cleanest distinction comes from ISO 9000 and systems engineering. It sounds pedantic, but it’s load-bearing:
Verification asks: Did we build the thing right? — truth relative to a specification. Did the surgeon check all 19 boxes on the checklist? Did the CI pipeline pass all tests? Did the bridge meet the load-bearing requirements in the structural analysis?
Validation asks: Did we build the right thing? — truth relative to reality. Does the checklist actually reduce infections? Does the test suite catch the bugs that matter in production? Does the bridge serve the community that needs it?
Verification is necessary but not sufficient for validation. You can verify perfectly against a flawed spec and end up with a system that works exactly as designed and is completely useless. The Tacoma Narrows Bridge was verified — the calculations were correct. It was not validated — the model didn’t account for aeroelastic flutter, and the bridge tore itself apart in a 40 mph wind.
This distinction matters because most of the systems we’ll examine in this essay operate at the verification layer. The surgical checklist verifies process compliance. The CI pipeline verifies code behavior against test cases. The financial audit verifies books against accounting standards. Validation — the deeper question of whether the spec itself is right — requires a different set of mechanisms, ones we’ll return to later.
For now, let’s stay with verification. Because even at this layer, the mechanism is more interesting than it first appears.
The Three-Step Dance
Every verification system, no matter how different it looks on the surface, performs the same three-step dance:
Step 1: Statement Specification. You declare what should be true. This is the checklist item, the test assertion, the compliance requirement, the hypothesis. It has to be falsifiable — if you can’t imagine a world where the statement fails, it isn’t a real statement, it’s a wish.
Step 2: Solution Specification. You build or define the thing that should satisfy the statement. This is the surgery, the code, the financial books, the bridge design. It’s the artifact being verified.
Step 3: The Test. You run a procedure that compares the artifact against the statement. This is the moment the checklist gets checked, the CI pipeline runs, the auditor opens the books, the load test begins.
The three steps are deceptively simple. The power comes from their relationship. The statement constrains what counts as success before the solution is built. This is the crucial asymmetry: verification is forward-looking in its specification and backward-looking in its execution. You define the criteria first, then check them after. If you define the criteria after seeing the result, you’re not verifying — you’re rationalizing.
Netflix’s chaos engineering practice makes this explicit. Before they kill a server, before they inject latency, before they simulate a region failure — they write a hypothesis: “We believe that injecting 200ms of latency between Service A and Service B will not increase error rates above 0.1%.” The hypothesis comes first. The experiment comes second. If they ran the experiment first and then asked “was that okay?” they’d be doing post-hoc storytelling, not verification.
The Three Properties
Across every domain I’ve examined — from medicine to software to finance to national security — verification systems that actually work share three properties. Remove any one and the system degrades, sometimes catastrophically.
1. An Object of Truth. There must be a clear, falsifiable statement that defines what “correct” means. Not a vague aspiration. Not “the system should be reliable” or “the code should be clean.” Something you can point to and ask: is this true or false? In data engineering, it’s a data contract: “this column contains only ISO 8601 timestamps and none of them are in the future.” In aviation, it’s a checklist item: “flaps set to 15 degrees.” In identity systems, it’s a cryptographic assertion: “this public key corresponds to this private key.” The object of truth is the thing that makes verification possible. Without it, you have opinions.
2. Independence. The person or system doing the verification must be structurally independent from the person or system that built the artifact. This is the deepest principle. Pronovost didn’t just create a checklist — he gave nurses the authority to enforce it, creating a verification agent independent of the surgeon. In software, the CI server is independent of the developer who wrote the code. In finance, the auditor is independent of the company being audited. In journalism, the fact-checker is independent of the reporter. When builder and verifier are the same entity, verification collapses into self-certification, and self-certification is worth exactly nothing when the pressure is on.
3. Repeatability. The check must produce the same result every time it’s run against the same artifact. A verification step that passes on Tuesday and fails on Thursday — with nothing having changed — isn’t verification, it’s noise. Repeatability is what distinguishes a verification system from an opinion. It’s why automated tests are more trustworthy than manual reviews for catching regressions: the test doesn’t get tired, doesn’t get bored, doesn’t think “I checked this last time and it was fine.” Repeatability also means the check is woven into the lifecycle, not performed once and forgotten. A security audit that happens once a year is a snapshot. A security scan that runs on every commit is verification.
Object of Truth
A clear, falsifiable statement
"timestamps never lie in the future"
Independence
Builder ≠ Verifier
"the CI server, not the developer"
Repeatability
Same check tomorrow = same answer
"woven into the lifecycle, not a one-off"
These three properties — object of truth, independence, repeatability — form a kind of diagnostic framework. When a verification system fails, you can almost always trace the failure back to one of these properties being absent or compromised. Enron’s auditors weren’t independent. Theranos’s blood tests weren’t repeatable. The pre-2008 credit ratings had no real object of truth — the models were calibrated to historical data that didn’t include the scenario that actually happened.
The Ghost Across Domains
Here’s where it gets interesting. Once you see the three-property pattern, you start seeing it everywhere. The same mechanism — the same ghost — appears in wildly different places, wearing different clothes but performing the same function.
Aviation. The pre-flight checklist is one of the oldest and most successful verification systems in existence, dating back to 1935 when Boeing’s Model 299 — which would become the B-17 Flying Fortress — crashed on a demonstration flight because the pilot forgot to release a gust lock. The plane was deemed “too much airplane for one man to fly.” Boeing’s response wasn’t to simplify the plane. It was to create a checklist. The modern version has a clear object of truth (each item on the checklist), independence (the first officer checks the captain’s work and vice versa, using a challenge-and-response protocol), and repeatability (the same checklist, every flight, forever). The NTSB has documented case after case where skipping a checklist item contributed to an accident. The checklist works not because pilots are incompetent but because even competent humans forget things under stress — and aviation decided, decades ago, that designing systems around this reality was better than pretending it didn’t exist.
Formal Methods. In software, formal verification uses mathematical proofs to show that a program satisfies its specification. Tools like TLA+ (created by Lamport), Coq, and Isabelle allow engineers to write precise specifications and then prove — not test, prove — that the code conforms. Amazon Web Services has used TLA+ to verify the design of DynamoDB, S3, and other critical services, finding subtle bugs that would have been virtually impossible to catch through testing alone. The object of truth is the formal spec, independence comes from the mathematical framework itself, and repeatability is guaranteed by the determinism of proof.
Finance. The “three lines of defense” model used by banks and regulators is explicitly a verification architecture. The first line is the business unit (the builder). The second line is risk management and compliance (the verifier). The third line is internal audit (the verifier of the verifier). Each line is structurally independent of the others. The model has its problems — the 2008 financial crisis demonstrated that independence can be undermined by shared incentive structures — but the architecture itself is a textbook application of verification principles.
Scientific Peer Review. The peer review system in academic publishing is, at its core, a verification mechanism. The object of truth is the scientific claim. Independence is maintained by anonymous reviewers who have no stake in the outcome. Repeatability comes from the expectation that the experimental methods are described in enough detail for replication. The replication crisis — the discovery that many published results fail to replicate — is, in verification terms, a failure of the repeatability property. The object of truth existed. Independence existed (imperfectly). But repeatability was not enforced, and so the system gradually filled with claims that couldn’t survive a second test.
Evidence-Based Management. The movement to bring experimental rigor to business decision-making, championed by scholars like Jeffrey Pfeffer and Robert Sutton in Hard Facts, Dangerous Half-Truths, and Total Nonsense, is an attempt to import verification into a domain that has historically run on intuition and authority. The object of truth is a measurable business outcome. Independence comes from separating the decision-maker from the measurement. Repeatability comes from running experiments rather than relying on one-off case studies.
Red Teaming. In security, red teaming is adversarial verification. You hire a team whose entire job is to break your system, and you give them independence (they don’t report to the team that built the system) and repeatability (they test regularly, not once). The object of truth is the security posture — the claim that the system is resistant to a defined set of attacks. The red team’s job is to falsify that claim. If they can’t break it, the claim gets stronger. If they can, you learn something before an actual adversary does. Schneier’s point about security as a process is, in this light, a point about the necessity of continuous, independent, repeatable verification.
AI Agent Verification. This is the newest domain, and arguably the most urgent. As AI agents gain the ability to take actions in the world — booking flights, writing code, executing trades, managing infrastructure — the verification problem becomes acute. How do you verify that an agent is doing what you intended when the agent’s behavior is non-deterministic, its reasoning is partially opaque, and its capability boundary is unknown? The object of truth is the agent’s mandate or instruction set — but unlike a checklist item or a test assertion, the mandate is often expressed in natural language, which is inherently ambiguous. Independence requires that the monitoring system not be the agent itself, a principle that current agent architectures frequently violate when they ask the model to evaluate its own outputs. Repeatability is the hardest property to maintain: the same prompt to the same model can produce different outputs on different runs. The alignment research community has been grappling with versions of this problem for years, often without using the word “verification” — but the structural challenge is identical to what Pronovost faced in the ICU. The system is too complex for any one actor to guarantee correctness by talent alone. You need a ghost.
| Domain | Object of Truth | Independence | Repeatability |
|---|---|---|---|
| Medicine | 19 checklist items | Nurse reads, surgeon acts | Every surgery |
| Software | Test assertions | CI server runs, not developer | Every merge |
| Finance | COSO/COBIT controls | Internal audit (3rd line) | Every quarter |
| AI / RLHF | Human preference ratings | Separate reward model | Every training batch |
| Military | After Action Review criteria | Red team vs. planners | Every operation |
| Politics | Auditable public claims | Free press, independent judiciary | Every election cycle |
The point of this survey isn’t encyclopedic coverage. It’s pattern recognition. These domains differ in their stakes, their vocabularies, their cultures, and their failure modes. But they share an underlying mechanism: a system that generates claims about the world and a separate system that checks those claims, repeatedly, with a clear standard for what counts as correct. The ghost is the same ghost.
Reconsidering the Checklist
Now let’s go back to the operating room and complicate the story.
Pronovost’s checklist is verification, not validation. It verifies that the surgical team followed the protocol. It does not, by itself, validate that the protocol is the right one. The checklist checks compliance with a spec. But who checks the spec?
This is where the story gets interesting. Over time, hospitals didn’t just use the checklist. They studied it. Researchers tracked infection rates before and after adoption. They compared hospitals that used the checklist against hospitals that didn’t. They ran randomized controlled trials. They published the results in peer-reviewed journals. This second layer — studying whether the checklist itself works — is validation.
And validation, it turns out, requires its own verification. The clinical trials that validated the checklist had to be verified: Were the data collected correctly? Were the statistical methods appropriate? Were the control groups properly matched? Verification and validation aren’t a one-time sequence. They’re a recursive loop. You verify the artifact against the spec. You validate the spec against reality. Then you verify the validation study against its own spec. It’s verification all the way down — or at least, it should be.
This recursion shows up in software too. The CI pipeline verifies code against tests. But who verifies the tests? Code review, where a human (independent of the test author) reads the test and asks: does this actually test what we care about? And who validates that the test suite covers the scenarios that matter in production? Observability systems, incident retrospectives, error tracking — all of which are themselves verification mechanisms applied to the broader question of whether the verification layer is doing its job.
The CI/CD pipeline is, in this light, a secular ritual. It doesn’t guarantee correctness. It doesn’t even guarantee that the tests are good. What it guarantees is discipline — a structured moment where the system pauses and asks: do the claims still hold? The ritual is the value, because the ritual creates the habit, and the habit creates the culture, and the culture is what catches the bug at 2 AM when nobody is thinking clearly and the temptation to ship without checking is strongest.
There’s something almost religious about this. Every major spiritual tradition has rituals of examination — the Catholic examination of conscience, the Jewish practice of cheshbon hanefesh (accounting of the soul), the Buddhist practice of mindful review. These aren’t verification in the technical sense, but they share the structural insight: you need a recurring, structured, honest encounter with the question “am I doing what I think I’m doing?” Left to our own devices, we drift. We tell ourselves stories. We round up. The ritual — whether it’s a prayer or a pipeline — is the mechanism that interrupts the drift.
The deepest failure mode in any verification system is not a false positive or a false negative. It’s the erosion of the ritual itself. When the team starts merging with failing tests “just this once.” When the auditor starts rubber-stamping reports because the client is important. When the nurse stops enforcing the checklist because the surgeon is senior and intimidating. The ghost doesn’t die in a dramatic failure. It dies quietly, through accumulated exceptions, each one individually reasonable, collectively fatal.
The Thesis
Here’s what I think all of this adds up to.
We don’t get truth in complex systems by being right. We get it by making it as easy as possible to prove ourselves wrong. The surgical checklist doesn’t make the surgeon smarter. The CI pipeline doesn’t make the developer smarter. The red team doesn’t make the security architecture smarter. What they do is create a structural condition — small, stubborn, independent, repeatable — where errors become visible before they become catastrophic.
This is a specific, testable claim about how complex systems maintain integrity over time. It says that the critical variable isn’t the quality of the initial design, or the talent of the people, or the sophistication of the tools. The critical variable is the presence or absence of verification mechanisms with three properties: an object of truth (something falsifiable to check against), independence (the checker is not the builder), and repeatability (the check runs every time, not just once).
When those three properties hold, systems self-correct. When any one of them is missing, systems drift. They drift slowly at first — an ignored test, a rubber-stamped audit, a suppressed inspection — and then suddenly, all at once, in the way that Hemingway said people go bankrupt: gradually, then suddenly.
The ghost of verification is the structural immune system of every complex endeavor that works. It’s the mechanism by which aviation went from one of the most dangerous forms of transportation to one of the safest. It’s the mechanism by which software teams ship changes to millions of users multiple times per day without everything catching fire. It’s the mechanism that, when absent, lets Enron happen, lets Theranos happen, lets the replication crisis happen, lets the 2008 financial crisis happen.
It is not glamorous. It is a sheet of paper in an operating room, a green checkmark on a pull request, a reviewer’s red ink on a manuscript, a penetration tester’s report gathering dust on a CISO’s desk. It is the most boring superpower in existence. And it is, I believe, the most underappreciated structural pattern in how human systems maintain contact with reality.
The pattern also explains why some institutions degrade and others don’t. The institutions that last — the ones that maintain quality over decades, that survive leadership changes and market shifts and technological disruption — are almost always the ones with deeply embedded verification mechanisms. They’re the hospitals that never stopped enforcing the checklist, the engineering organizations that never started merging with broken builds, the financial institutions that never let the auditors get cozy with the clients. The verification mechanisms are not separate from the institution’s character. They are the character, expressed as structure rather than intention.
And this is the most important point: character expressed as structure is more durable than character expressed as intention. Good intentions fade. Good structures persist. A hospital full of well-intentioned doctors who sometimes forget to wash their hands will have a higher infection rate than a hospital full of average doctors with a strictly enforced checklist. The checklist doesn’t replace good intentions. It survives the moments when good intentions aren’t enough.
Try It Yourself
Pick a domain below. See how the three properties manifest — where the object of truth lives, how independence is maintained, and what makes the check repeatable. Then try removing one property in your mind and watch what breaks. The point isn’t to memorize a framework. It’s to develop the instinct for spotting verification’s presence — and, more importantly, its absence.
Medicine
Coda
If there is a rule of thumb buried in all of this, it’s the one I keep returning to:
Genius is overrated. The real superpower is designing your work so that, if you’re wrong, something small and stubborn will tell you — early, clearly, and every single time.
Not a dashboard. Not a meeting. Not a quarterly review. Something that runs on its own, that doesn’t care about your feelings, that checks the same five things every single time with the bored persistence of a nurse holding a clipboard. Something independent of the person who built the thing. Something repeatable enough that it catches the drift before the drift becomes a disaster.
The surgeon doesn’t need to be brilliant. The pilot doesn’t need to be infallible. The developer doesn’t need to write perfect code. The organization doesn’t need to hire only geniuses. What they need is a ghost — a small, persistent, independent mechanism that asks the same boring questions, every time, and refuses to let you proceed until you answer them honestly.
Build the ghost into the system, and the system gets better over time. Remove the ghost, and the system runs on faith. Faith works great right up until the moment it doesn’t.
A surgical checklist, a CI pipeline, a red team, and a free press walk into a bar. The bartender says: “What’ll it be?” They all answer the same thing: “The truth. Verified.”