WAC*

19 hrs ago

Wins above Claude and Co. Plus, Google grows up.

4 Comments

The Laszlo Bock quote is doing the load-bearing work in this argument because it puts a date on when companies stopped pretending interview puzzles correlated with on-the-job output. AI benchmarks are running about fifteen years behind that realization, with SWE-Bench Verified and GDPval producing scores that read like a Mensa pass, not a production result. The wins-above-replacement frame names what's actually being asked, which is what this model does that the prior model could not, and at what unit cost. Until evals can answer that on a per-task basis, benchmarks are mostly a recruiting signal for the labs themselves.

Jimmy Pang

Ah well, I always wonder how the heck do the company benchmark the performance of the LLMs.

Agree on the limitation of the benchmarking. The vendors need those to pitch how good their products are, which is once again just the capital working here.

This sort of benchmarking also gives customers a better general idea (e.g. think comparing horsepower of cars for the car owners). It helps customers making more informed purchase decisions as well.

At the macro level, it would also slowly roll into the "sorting" phase like evolution - just like what we discussed in your previous article, Benn.