A better way to lie with statistics
Statistics, it would seem, has a branding problem.
Over the last fifty years, the best selling text on the subject is How to Lie with Statistics. Google “quotes about statistics,”1 and the first result is “There are three types of lies—lies, damn lies, and statistics.” The most common lesson people seem to remember from their high school or college stats class is a cliché warning: “Correlation doesn’t equal causation.”2
Though I appreciate an EPIC TAKEDOWN of a ridiculous Fox News chart as much as anyone, our fixation on misleading statistics and deceptive visualizations distracts us from another way that numbers deceive us, which is far more common, far more direct yet far more subtle, and, oddly, far less discussed: Lying not with statistics, but around statistics, with words.
On July 25, 2019, we narrowly avoided a global disaster when a large asteroid, estimated to be between 200 and 400 feet across, passed within 45,000 miles of Earth.
In 2021, Minnesota Twins first baseman Miguel Sano, in almost a thousand defensive chances, committed only thirteen errors. His fielding percentage of 98.6 percent was nearly perfect.
COVID cases spiked in New York City in April, with the daily case count rising from 650 in early March to nearly 1,700 a month later.
Last January, a California man posted a video on TikTok pledging to exercise every day until Taco Bell brought back the Grilled Stuft Nacho. The clip went viral with hundreds of thousands of views.3
Reels, Facebook’s new TikTok competitor, accounts for only six percent of the time users spend on Facebook and Instagram.
We read sentences like these all the time. By most standards, they’re fair, well-written, and adhere to Amazon’s4 pop principles of good writing, by replacing adjectives and “weasel words” with data. If we were to see one of these claims in a news story or a company deck, we’d likely not only take it at face value; we’d applaud its balanced tone and quantitative rigor.
But they’re not balanced. All of these examples are either confusing, misleading, or outright false—they just do a much better job of hiding it than a deceptive y-axis.
The asteroid was real, and the figures about its size and proximity to Earth are accurate. However, to a layperson (or, at least, to this layperson), the numbers are meaningless on their own. I have no idea if 300 feet across is big for an asteroid (it seems kinda big?) or if 45,000 miles is close (it seems almost incomprehensibly far?). My entire perception of the story is defined by the word “narrowly.” Had it said “we comfortably avoided a global disaster,” I would’ve read it completely differently. In this case, the data is indecipherable, and it serves little purpose other than to provide an aura of scientific certainty that makes me, baselessly, more convinced that whatever narrative the story presents is true.
Miguel Sano’s stats are also accurate—but framing his season as “nearly perfect” and including “only” in the sentence makes it close to an outright lie. Among all first basemen in the major leagues, he had the worst fielding percentage and made the most errors. Still, successfully doing anything nearly 99 percent of the time sounds good, and most of us, including stats-minded baseball fans, would be fooled by the original claim. We wouldn’t interrogate the number; we wouldn’t ask if Sano’s performance was good relative to other first basemen in the major leagues (by this measure, it was bad), relative to other positions (by this measure, it was pretty good), relative to anyone who plays baseball recreationally (by this measure, it was outstanding), or relative to Sano’s prior seasons (by this measure, it was one of the best of his career). Instead, we’d do as we do with most numbers we read but don’t entirely understand: gloss over it, assume the number and the adjectives that describes it are correctly associated, carve a subconscious groove between “Miguel Sano” and “good fielder” in our head, and move on.
The COVID case figures are also correct. But was it a spike? That’s entirely subjective. It’s not a spike because it’s an increase from a low baseline. It is a spike because case counts are meaningless; you have to adjust for testing rates. It’s not a spike because people are taking far fewer precautions than they were earlier in the pandemic, so some increases are expected. It is a spike because more people are vaccinated; relative to the size of the unvaccinated population, the case count is alarming.
As with Sano’s stats, a superficial read of the original argument would likely make us think we’re in the midst of another surge.5 But unlike Sano’s stats, COVID cases are hard to contextualized. It’s a gray area, and people can easily tilt us to one side or the other by simply saying which one they want us to believe—and in this case, by doing nothing more than using the verb “spiked.”
The TikTok story is just one of many examples where something is described as “going viral,” and some big-sounding number of views or shares is tacked on to prove it. But going viral (appropriately, like “a spike in cases”) is a vague, idiosyncratic term. Was Chris’ post widely viewed? Is hundreds of thousands of views a lot? On one hand, it’s certainly more than your average blog post on the future of SQL as a semantic modeling language gets. On the other hand, some of Chris’ videos now get tens of millions of views. Other TikToks—like slightly hypnotizing ten-second clips of head bobbing and lip syncing—get 686 million views and launch pop careers. And some YouTube videos—like slightly nauseating jingles of cheap Blue’s Clues knockoffs—have been watched ten billion times. So how do we judge Chris’ first Taco Bell TikTok? According to how the Washington Post tells us to.
Finally, the Reels figure is the most challenging of all. It’s true that Reels “only” accounts for six percent of people’s time on Facebook, but there’s no realistic way to know if that’s an impressive number or an abysmal one. What do you benchmark it against? The last time a company that has three billion active users launched a product designed to compete with the most popular website on the planet? While Facebook surely has internal targets, they’re likely inventions of bias and PR-motivated sandbagging. Moreover, Facebook is free to spin the Reels rollout however they want—just as he did in February, when he declared the product a smashing success. As outside observers, we have little choice but to cautiously accept his position, or to categorically deny it because, of course that’s what he’d say.
The point of all of this isn’t to say that data is worthless, or that all numbers are unredeemable lies. The point is that people often use data to tell a story—we almost went the way of the dinosaurs; Miguel Sano is a good fielder; we can’t shake COVID; a single TikTok video launched a movement; Facebook’s new feature is a flop. In doing so, people are sometimes compelled, by dishonesty, bias, or the simple desire to say something interesting, to twist that story to fit a particular narrative. As readers, we’ve been taught to be vigilant against such transgressions, and to watch out for doctored data, lying charts, and deceitful statistics. But to the talented con man—or more generously, the clever marketer—these flamboyant crimes are a distraction, a magician’s misdirection from the real, direct action that’s taking place right under noses: People can spin stories simply by telling us what to think. It’s not the data that gets us, but the adjectives that describe it.
Once you notice this phenomenon, you see it everywhere. Nearly every news story, every blog post, every analyst report, and even every email that references some corporate statistic follows the same pattern: A datapoint, and a brief description—or subtle nudge, like the word “just”—tells us what it means. Ask yourself though: Would you come to the same conclusion with the data alone? As often as not, we wouldn’t—not because the conclusion is wrong, but because, when presented with data on some domain we don’t deeply understand,6 we have no choice but to look for clues and shortcuts to help us make sense of those numbers. Our best shortcuts are typically the words around the data, so we interpret it the way we’re told to. The claim decodes the data, and the data proves the claim.
So what do we do about it? I’m not sure there’s that much we can do. Most people aren’t trying to deceive us; we can’t throw out every number as a conspiracy theory; there’s a fine line between healthy skepticism and tin hat paranoia. Sometimes, we have to have faith in the system, and accept that the odds that we botch “our own research” are higher than the odds we’re being lied to.
But we should at least acknowledge that we should be as wary about words as we are charts. As Cassie Kozyrkov put it in her retort to W. Edwards Deming’s famous quote, “without data, you're just another person with an opinion,” with data, you’re still just another person with an opinion. Though we often assume these opinions are camouflaged in clumsy chart crimes, it’s far more common for them to be spelled out in plain language. And when people tell us their opinions, we should believe them.
Tell me you’re putting together a presentation on data science without telling me you’re putting together a presentation on data science.
It’s so cliché that, after typing “correlation doesn’t eq”, my Google editor suggests the rest for me.
As it happens, I know this California man. Delete this email; close this tab; unsubscribe from this Substack; read the Washington Post article; follow him; join his movement. You won’t be disappointed.
Which raises the question: What is a surge?
Which, for most of us, is every domain, minus maybe one or two.