WAC*
Wins above Claude and Co. Plus, Google grows up.

A question that people often think about these days is, “how good is AI at doing stuff?” A lot of stuff that people do—like building software, and responding to customer support tickets, and drafting legal documents, and analyzing data—is either stuff that they would rather not do, or stuff that their employer would rather not pay them to do. Large language models seem capable of doing that stuff. So, for better or for worse, whenever a new AI model comes out—which, in 2026, happens every roughly every other day1—people want to know: Can the new model do the stuff?
Over the last few years, a popular method for answering that question has been to administer a test. Create a long list of questions, about software or support tickets or charts or the law; give the questions to the model; see if it gets the right answers. Can it write code? Can it follow instructions? Demonstrate empathy? See pictures? Do math? Do math when the problems are pictures? Pass a Mensa test? Use a computer? Use the internet? Pass the LSAT? Pass the bar? Resist lying after it passes the LSAT and the bar? Create a test and find out.
Of course, everyone knows these tests aren’t perfect, but what else can you do? And surely, these tests are better than nothing.
A question that people think about a little less these days is, “how good is a person at doing stuff?” It used to be that if a business needed to do stuff—if they needed to debug software or respond to customer support tickets or draft legal documents or make charts—they wouldn’t go shopping for a computer agent; they would create a “job”—that is, a collection of tasks to be done inside of a particular company—and hire someone to do it. And so, people wanted to know: Which people can do those tasks, in that company?
For a while, one popular method for answering that question was to administer a test. The company would invite candidates to their office, ask them to solve logic puzzles or do math or play chess, and see if they got the right answers.
Part of the theory behind these interviews was the same as theory behind any academic test: Sure, the test isn’t the job, just as the MCAT isn’t an operating table, but performance on the test correlates with performance at the job. People who are good at solving coding puzzles on a whiteboard might be good at debugging production software; people who are good at math puzzles might be good at making useful business charts; people who are good at abstract games of logic might be good at finding market, uh, inefficiencies.
But a bigger part of the theory was, “what else can you do?” Companies only get so much time with job candidates. Nobody would’ve argued that the tests were perfect; some people would’ve argued that the tests are better than nothing.
Fewer people would argue that now.2 From The Atlantic, thirteen years ago:
“We found that brainteasers are a complete waste of time,” Laszlo Bock, senior vice president of people operations at Google, told the New York Times. “They don’t predict anything. They serve primarily to make the interviewer feel smart.”
And new methods became trendy instead. From Linear, a few years ago:
We believe the only way to build a quality product and business is to hire people we can trust to make good judgments, across all functions and levels. …
The problem is that these types of people, unfortunately, are few and far between. The majority of companies don’t work in the way we do, which leads to fewer people with these kinds of skills. We found that standard interview processes didn’t work well for us. It’s challenging to assess in interviews if someone is truly a builder, has good taste and judgment, can take initiative, and approaches problems productively. … A conventional interview process, often modeled by large companies, doesn’t account for this.
To evaluate if a person is a fit for Linear with the skills to be successful, we bring candidates in for a work trial as the final step in the interview process. A work trial is a paid 2-5 day period where a candidate works with our team on a real project that we plan to implement (or as close as possible to that) with access to relevant internal tools and resources.
In other words, despite many years of test development, a job still cannot be contained in a test. Not only does a job take “good taste and judgment,” but employees also have “relevant internal tools and resources” that test-takers do not. A person who is very familiar with those tools or good at asking questions of those resources, for example, might be great at the job and terrible at the test.
Moreover, a job evaluation cannot be contained in an afternoon. Companies have been interviewing people for jobs for decades, at great expense. And currently, the state-of-the-art method of evaluation is to work with them for a while, in the actual office, on the actual job. When every business has its own quirks—its own culture; its own processes; its own problems; its own coworkers—the only way to test if someone is good at the job, it seems, is for them to do the job, to encounter those quirks, and to see how it all goes.
So, you know. Could AI be bound to the same constraints? At first, when large language models were primarily used inside stateless call-and-response chatbots, you could fill a test with representative examples of what people wanted the chatbot to do. But now, AI is an agent—it is a persistent system that talks to itself, it has memories from prior conversations, and it has access to internal tools and resources. That’s not an employee, but it does a lot of things that employees do. And how do you benchmark that?
A question that a handful of people think about very intensely is, “how good are baseball players at playing baseball?” Millions of people play baseball, and there is a very lucrative market for the few hundred people who are the very best at it, so there is also a very lucrative market for the people who are the very best at appraising baseball players. How do those people decide who is good?
The original answer was basic statistical benchmarks. A pitcher that can get 27 outs before giving up three runs is very good. A hitter that gets three hits in every ten at-bats is very good, as is a batter that can drive in 100 runs in a season, or hit 500 home runs in a career. Baseball is full of bold lines around round numbers.
Over time, people realized that these numbers are too crude, and came up with more advanced statistics. Don’t just count hits; count the number of times people get on base, weighted by which base they reach. Don’t count wins; count baserunners allowed per inning pitched.
But these numbers still had problems, because they were absolutes. Players aren’t evaluated against some arbitrary fixed standard; they’re evaluated against their contemporaries. So the baseball appraisers came up with “wins above replacement,” or WAR, which aggregates everything a player does on a baseball field in an attempt to define how much better (or worse) the player is than the average player that a team could call up from the minor leagues.3 And:
Over the last two decades, the Wins Above Replacement metric (WAR), which combines the contributions of different performance elements, weighted according to their contributions to team wins and adjusted for the environment in which a player’s statistics were accrued, has become the metric of choice for global evaluation of player performance. WAR is superior to traditional counting stats like [home runs, runs batted in, wins, and strikeouts] because it incorporates many diverse elements of performance that contribute to team wins and adjusts for the environment (league, ballpark, era) in which a given player performed.4
Though the exact formula for WAR is complicated—and there are a number of different variants—the idea behind it is simple: How much better is this player than a generic one that would take no effort to identify and relatively little money to pay? It’s a metric that doesn’t measure specific performance; it measures how far something is above an easy baseline.
A question that nearly everyone now thinks about is, “if I YOLO a bunch of MCPs into Claude, will it be able to do my job for me?” For every task, that is the new global competitor: Claude Code (or Codex, or Cowork, or ChatGPT, or Copilot, or Cursor, or Clawdbot, or, I guess, the generalized form, C*), connected to everything. Want to build a better coding robot? It’s your product versus Claude Code integrated with Linear. Want to build a better analytical agent? It’s your product versus Codex connected to Slack, a dbt MCP, and Databricks. Want to build a better automated personal assistant? It’s your product versus Clawdbot, running wild on a Mac Mini. Want to be a product marketer? It’s your launch blog post versus one written by Cowork, connected to Google Drive.
When we built a new analytical benchmark last year, that was the lesson we quickly learned: Nobody cared about the scores or about which models performed the best against some arbitrary set of tasks. They cared about how well different products performed against some system that they’d hacked together over the last two weeks, and had integrated with their relevant internal tools and resources. They cared about the new tool, against the easy—and increasingly default—baseline.
It points to a fundamental tension that now exists in nearly every benchmark: How do you create a standardized benchmark when context—the idiosyncratic stuff that is, almost by definition, impossible to standardize—is the critical thing that makes one tool useful and another tool unusable?
But maybe that is a question that we can all stop thinking about. We don’t benchmark employees, nor do we benchmark traditional software. Perhaps AI software will soon be the same. And instead, the next question that every product—and maybe every employee?—will have to answer, is, how much are you worth above C*?
CSS
It has become a truism to say that it’s not the model that matters; it’s the harness that matters.
But what is a harness? The normal answer is that it’s a complex set of logical instructions that turn a user’s request into a series of recursive, self-authoring prompts. We speak hyperbolically of harnesses now; harnesses are on the frontier of computer engineering; harnesses are the platform; harnesses are AGI.
But perhaps that is overcomplicating things. Could a harness be a web interface that encourages its users to write better prompts? Could it be a text box with useful suggestions for what to type? Could it just be a text that is tall enough to type in?
We have joked about this before:
If you’re Google, here’s another idea, if you want a better and cheaper wrapper around Gemini: Make the Google search box bigger.
People already like Google’s AI Mode more than ChatGPT! But one-line boxes are for search! Two-line boxes are for chat! Even if you can AI Mode from google.com, nobody’s gonna chat in a one-line box! So just make it a two-line box! That’s the thoughtfully designed interface that make Gemini accessible to everyone. Google doesn’t need a $2 billion acquisition. Google just needs some new CSS.
Google changes its search box for the first time in 25 years
For 25 years, Google’s iconic search box was a long, slender bar where people typed in keywords like “World Cup.”
But over the past three years, artificial intelligence allowed people to type in longer, more complex questions like “Who are the top 24 teams in the World Cup and what chance does the United States have of advancing?”
On Tuesday, Google said the A.I. shift had inspired it to overhaul the dimensions of its search bar for the first time since 2001. The box is getting bigger and more interactive so that people can ask even longer questions and upload photographs and videos into queries.
According to one website that appears to keep track of such things, model providers have released 62 models in the last 126 days.
Would anybody? I have no idea. On one hand, Silicon Valley’s current culture seems more enamored with the concept of IQ than ever. For example, Cognition, a leading AI-coding startup, built its early brand around how good its team was at math tests. On the other hand, in its job listings, Cognition specifically says in it that they “care more about demonstrated capability than credentials. A PhD is one signal among many.” On the third hand, is doing well on very hard math tests a demonstrated capability, or a credential?
That is, WAR is a measure of how much better a player is than a bad major leaguer. WAA, or “wins above average” is a measure of how much better a player is than an average major leaguer. So same, but different.
For example, nobody hits .300 anymore.
The Laszlo Bock quote is doing the load-bearing work in this argument because it puts a date on when companies stopped pretending interview puzzles correlated with on-the-job output. AI benchmarks are running about fifteen years behind that realization, with SWE-Bench Verified and GDPval producing scores that read like a Mensa pass, not a production result. The wins-above-replacement frame names what's actually being asked, which is what this model does that the prior model could not, and at what unit cost. Until evals can answer that on a per-task basis, benchmarks are mostly a recruiting signal for the labs themselves.
Ah well, I always wonder how the heck do the company benchmark the performance of the LLMs.
Agree on the limitation of the benchmarking. The vendors need those to pitch how good their products are, which is once again just the capital working here.
This sort of benchmarking also gives customers a better general idea (e.g. think comparing horsepower of cars for the car owners). It helps customers making more informed purchase decisions as well.
At the macro level, it would also slowly roll into the "sorting" phase like evolution - just like what we discussed in your previous article, Benn.