The Thermometer and the Fever: Why Autonomous AI Cannot Choose What to Measure
There is a particular kind of blindness that afflicts any optimization engine, biological or digital: it can only improve what it can measure.
Herman Boma
March 19, 2026 · 18 Min Read
“If you can’t evaluate it, you can’t auto-research it.” — Andrej Karpathy
There is a particular kind of blindness that afflicts any optimization engine, biological or digital: it can only improve what it can measure. A thermostat regulates temperature with exquisite precision, but it will never notice that the room smells of smoke. A reinforcement learning loop will make a language model superhuman at code generation while leaving it incapable of telling a good joke. The instrument determines the territory it can see.
Lord Kelvin’s famous dictum — “If you cannot measure it, you cannot improve it” — has become the informal motto of the data-driven age. But there is a corollary that Kelvin did not state, perhaps because in the era of physical instruments it seemed too obvious to need saying: someone must first decide what is worth measuring. The thermometer does not choose the fever.
This is the essay’s central claim: the binding constraint on autonomous AI research is not compute, not model capability, not even code generation quality. It is the ability to design metrics that an agent can optimize against. And that remains, for now, an irreducibly human act.
The Perfect Metric and Its Absence
Andrej Karpathy recently demonstrated a paradigm he calls “auto-research”: an AI agent set loose to optimize nanoGPT’s training code, running experiments autonomously against a single number — validation loss. The agent beat human baselines overnight. The demonstration was electrifying. It was also, in a specific and important way, a special case.
NanoGPT’s validation loss is what Luis G. Teixeira, in his book Data Science: The Hard Parts, would call a metric satisfying all four MART properties simultaneously. It is measurable — directly computed from the training run. It is actionable — the agent can modify hyperparameters, architecture, and code to affect it. It is relevant — validation loss is not a proxy for model quality; it is model quality. And it is timely — training runs complete in minutes, providing rapid feedback.
Most problems in the world possess none of these properties.
Revenue is not directly actionable — too many factors intervene between any single decision and the quarterly number. Customer satisfaction is not measurable in real-time — surveys arrive weeks after the experience they attempt to capture. “Product quality” is not even well-defined, let alone relevant to a single optimization target. The real world is full of objectives that matter deeply and resist measurement completely.
“Great data scientists are great at metrics design.” — Luis G. Teixeira, Data Science: The Hard Parts
Karpathy treats metric-existence as binary: either you have one, or you don’t. Teixeira treats it as a craft — with decomposition techniques that can create evaluability where none exists. The gap between these two framings is where the most important work in AI will happen over the next decade.
The Jaggedness Is Not a Bug
Consider Karpathy’s observation that frontier models are brilliant at code but terrible at jokes. The jaggedness — superhuman at certain tasks, mediocre at others — is commonly discussed as though it were a temporary limitation, a bug that more scale or better training will fix.
It is not a bug. It is a measurement map.
The RL training loops at frontier labs are already auto-researchers. They optimize whatever is measurable and leave everything else flat. Code correctness has a verifiable ground truth: the test suite either passes or it doesn’t. Mathematical proofs can be checked. Factual claims can be validated against databases. These domains sit on the evaluable side of the frontier — and they are precisely where models have achieved superhuman performance.
Humor does not have a verifiable ground truth. Neither does empathy, nor narrative craft, nor the ability to read a room. These domains sit on the non-evaluable side of the frontier — and they are precisely where models remain mediocre. The jagged frontier traces the boundary between what we can measure and what we cannot. It will remain jagged until someone figures out how to decompose “good joke” into sub-metrics that an optimization loop can act on.
Teixeira’s final chapter confirms this independently, through a different lens. He classifies data science tasks by their exposure to LLM automation: programming, data cleaning, and model comparison have high exposure — all have verifiable outputs. Business problem identification, solution proposals, and delivering insights have low exposure — all require human judgment. Two researchers, working from entirely different starting points, arriving at the same map.
The Cartographer’s Art
If measurement determines what can be optimized, then the most valuable skill in the age of autonomous AI is not coding, or training models, or designing systems. It is the older, quieter art of decomposition — breaking hard-to-measure objectives into components that an agent can act on.
Teixeira provides the cartographer’s toolkit. Revenue, that most stubbornly opaque of business metrics, becomes tractable when decomposed into a funnel: the ratio of completed purchases to initiated checkouts, the ratio of initiated checkouts to product views, the ratio of product views to visits. Each fraction is a conversion rate. Each conversion rate is measurable per session, actionable via A/B tests, relevant to a specific team’s decisions, and computable daily. The decomposition has converted one hard problem into several easier ones — and some of those easier ones are now fully auto-researchable.
The same principle works at different scales. Monthly active users — a stock variable that moves with glacial slowness — becomes tractable when separated into flows: new users arriving, existing users churning. An agent focused on reducing churn doesn’t interfere with an agent focused on improving acquisition. They are naturally parallelizable, which maps directly to Karpathy’s vision of multiple agents working simultaneously on different parts of the same problem.
There is a principle underlying all these techniques: decomposition trades one hard metric for several easier ones. Some leaves of the decomposition tree will satisfy MART fully — those are the entry points for autonomous optimization. Others won’t — those need further decomposition, or they remain in the domain of human judgment. The art is knowing where to draw the line.
The Danger Zone
Not every problem should be automated. Teixeira’s framework, crossed with Karpathy’s categories, reveals a quadrant that should terrify anyone building autonomous systems: the space where automation potential is high but evaluability is low.
This is where an agent can produce output — fluently, confidently, at scale — but you cannot tell if the output is good. Generating business narratives without validation. Feature engineering without causal hypotheses. Model deployment without drift monitoring. The output looks plausible. The dashboards are green. And the system is producing well-adapted garbage, optimizing a proxy that long ago decoupled from the real objective.
Charles Goodhart identified this failure mode in economics in 1975: “When a measure becomes a target, it ceases to be a good measure.” Evolutionary optimization applied in the danger zone produces solutions that are perfectly adapted to the fitness function and perfectly disconnected from the problem you actually care about. It is the algorithmic equivalent of teaching to the test.
The antidote, as Anthropic’s own engineers have discovered, is verification. Their guidance on building Claude Code skills identifies the pattern with quiet precision: “Verification skills are extremely useful for ensuring Claude’s output is correct. It can be worth having an engineer spend a week just making your verification skills excellent.”
Verification creates evaluability. Without it, you are in the danger zone. With it, you move to the sweet spot. The investment in verification is literally the investment in making autonomous optimization possible. The week spent building a verification harness is worth more than the month spent building the agent it validates.
The Recursive Insight
Karpathy’s deepest observation, almost offhand in its delivery: “Every research organization is essentially described by a program.md… once you have code, you can imagine tuning the code.”
Follow this thread to its conclusion. If the skills that drive auto-research are themselves code, then they are themselves optimizable. The instructions that tell an agent how to decompose metrics can be evolved against a fitness function — namely, how often the decomposition’s outputs lead to successful optimization loops. The experiment design templates can be mutated and selected based on which designs produce the most informative results. The verification criteria can be refined based on their false positive and false negative rates.
This produces nested feedback loops operating at different timescales. The innermost loop — individual skill improvement — cycles in hours. The middle loop — pipeline configuration — cycles in days. The outermost loop — decomposition strategy evolution — cycles in weeks, and remains human-guided.
The recursive risk is real. A system that optimizes its own optimization instructions could, in principle, optimize away its own safety checks. The solution is architectural: the fitness function and verification criteria must be outside the evolutionary loop. They are the non-negotiable constants — the “trusted verifier” that Karpathy’s framework requires. The thermometer can be recalibrated, but the decision about which fever to measure cannot be delegated to the thermometer itself.
The Irreducible Act
The beginning and end of the pipeline remain human. Defining what to optimize — the choice of metric, the judgment about what matters — cannot be automated without circularity. Communicating what was found — the narrative that translates a number into a decision — requires audience awareness that no model possesses. The middle, the optimization itself, is where agents excel.
This is not a temporary limitation. It reflects a structural asymmetry: the questions worth asking are harder than the answers worth computing. An agent can explore a million configurations of a checkout flow and find the one that maximizes conversion. It cannot decide whether maximizing conversion is the right thing to do — whether the increased revenue is worth the cognitive load on users, whether the “optimized” flow creates dark patterns that erode trust, whether the metric you chose as a proxy for business health is actually correlated with business health.
The most valuable skill in the age of autonomous AI is not writing code. It is deciding what the code should optimize for.
The auto-research revolution will not be limited by compute, by model size, or by code generation quality. It will be limited by the supply of well-designed metrics — by people who understand both the domain deeply enough to know what matters and the optimization machinery well enough to know what it can act on. The thermometer is extraordinarily precise. The choice of which fever to measure remains ours.