The Laszlo Bock quote is doing the load-bearing work in this argument because it puts a date on when companies stopped pretending interview puzzles correlated with on-the-job output. AI benchmarks are running about fifteen years behind that realization, with SWE-Bench Verified and GDPval producing scores that read like a Mensa pass, not a production result. The wins-above-replacement frame names what's actually being asked, which is what this model does that the prior model could not, and at what unit cost. Until evals can answer that on a per-task basis, benchmarks are mostly a recruiting signal for the labs themselves.
Ah well, I always wonder how the heck do the company benchmark the performance of the LLMs.
Agree on the limitation of the benchmarking. The vendors need those to pitch how good their products are, which is once again just the capital working here.
This sort of benchmarking also gives customers a better general idea (e.g. think comparing horsepower of cars for the car owners). It helps customers making more informed purchase decisions as well.
At the macro level, it would also slowly roll into the "sorting" phase like evolution - just like what we discussed in your previous article, Benn.
The Laszlo Bock quote is doing the load-bearing work in this argument because it puts a date on when companies stopped pretending interview puzzles correlated with on-the-job output. AI benchmarks are running about fifteen years behind that realization, with SWE-Bench Verified and GDPval producing scores that read like a Mensa pass, not a production result. The wins-above-replacement frame names what's actually being asked, which is what this model does that the prior model could not, and at what unit cost. Until evals can answer that on a per-task basis, benchmarks are mostly a recruiting signal for the labs themselves.
Ah well, I always wonder how the heck do the company benchmark the performance of the LLMs.
Agree on the limitation of the benchmarking. The vendors need those to pitch how good their products are, which is once again just the capital working here.
This sort of benchmarking also gives customers a better general idea (e.g. think comparing horsepower of cars for the car owners). It helps customers making more informed purchase decisions as well.
At the macro level, it would also slowly roll into the "sorting" phase like evolution - just like what we discussed in your previous article, Benn.
Yeah lets see Claude play baseball. If AI is going to replace all of us,does that include professional athletes? Can Claude hit a curveball?
Once it can, it will do it better than anyone else, and perfectly every time.
But endless perfect pitches & perfect hits make for a pretty boring game.