4 Comments
User's avatar
Alec Pritzos's avatar

The Laszlo Bock quote is doing the load-bearing work in this argument because it puts a date on when companies stopped pretending interview puzzles correlated with on-the-job output. AI benchmarks are running about fifteen years behind that realization, with SWE-Bench Verified and GDPval producing scores that read like a Mensa pass, not a production result. The wins-above-replacement frame names what's actually being asked, which is what this model does that the prior model could not, and at what unit cost. Until evals can answer that on a per-task basis, benchmarks are mostly a recruiting signal for the labs themselves.

Jimmy Pang's avatar

Ah well, I always wonder how the heck do the company benchmark the performance of the LLMs.

Agree on the limitation of the benchmarking. The vendors need those to pitch how good their products are, which is once again just the capital working here.

This sort of benchmarking also gives customers a better general idea (e.g. think comparing horsepower of cars for the car owners). It helps customers making more informed purchase decisions as well.

At the macro level, it would also slowly roll into the "sorting" phase like evolution - just like what we discussed in your previous article, Benn.

Jim Ryan's avatar

Yeah lets see Claude play baseball. If AI is going to replace all of us,does that include professional athletes? Can Claude hit a curveball?

Marco Roy's avatar

Once it can, it will do it better than anyone else, and perfectly every time.

But endless perfect pitches & perfect hits make for a pretty boring game.