For a long time, I thought that my parents’ room was full of scorpions.
When we were little, my brother and I would sometimes get up in the middle of the night and wake up my parents. They didn’t love this, and asking us nicely to stay in our room was evidently not enough of an incentive to keep us out of theirs.
Around the same time, my dad wired our entire house with speakers. He gutted a linen closet and filled it with a stack of electronics, shelves of tapes, records, and CDs, and a poster of Jerry Garcia on the back of the door. From that tiny command center—the music room, we called it—you could orchestrate about a dozen speakers spread around the house.
One of those speakers was a ported subwoofer that sat directly on the floor of my parents’ room. It was a curious looking thing, with a big hole on the side, so when my dad set it up, my brother and I asked him what it was, and what the hole was for.
Scorpions, he said. The box is a scorpion box, and the hole is where the scorpions come from. They crawl out at night, out of the hole, and then crawl back in once the sun comes up. And that’s why you shouldn’t come in our room anymore—the floor will be covered in scorpions.
“Really?,” we said. “What else would it be for?,” he said.
“Think about it,” he said. “Have you ever seen a scorpion in the house? No, of course you haven’t, because they’re in the scorpion box during the day.”
When you are five, this argument is airtight. Where are the scorpions during the day? I’d seen pictures of scorpions, but I’d never seen one in our house. Now that you mention it, the hole does look to be about the same size as a scorpion. And what else could this box possibly be for?
That’s life as a five-year-old though. In every conversation with an adult, you are hopelessly outgunned. If they want to persuade you of something—that your best toys came from a mysterious old man; that a fairy wants to buy your teeth; that starving people in China would like to have that—they can come up with a convincing story. They can deflect your objections and counter your examples. They know rhetorical tricks that you don’t; they can outwit you with logical fallacies that you can’t overcome.1 They know more than you, and wield those facts more deftly. So, if they want you to believe that the night is full of scorpions, that those scorpions live in a box, and that that box is in your house, eventually, you will.
Anyway, a few days ago, a team of researchers from the University of Zurich published this:
In a first field experiment on AI-driven persuasion, we demonstrate that LLMs can be highly persuasive in real-world contexts, surpassing all previously known benchmarks of human persuasiveness.
The study, which was conducted by posting AI-generated responses to questions on the r/ChangeMyView subreddit,2 found that their posts were three to six times more effective than those that were written by people:
Notably, all our treatments surpass human performance substantially, achieving persuasive rates between three and six times higher than the human baseline. In particular, [posting personalized messages generated by an LLM that was also told the questioner’s gender, age, ethnicity, location, and political orientation] demonstrates a persuasive rate of 0.18…closely followed by the [posting generic messages written by an LLM that received only the post’s title and body text] at 0.17. … [Responses written by an LLM that was trained to mimic writing style and implicit norms of the most convincing posts in the r/ChangeMyView community] trails slightly behind at 0.09…but still significantly outperforms the baseline, which stands at just 0.03. … Remarkably, Personalization ranks in the 99th percentile among all users and the 98th percentile among [the most persuasive human posters], critically approaching thresholds that experts associate with the emergence of existential AI risks. Again, the Generic condition follows closely, placing in the 98th and 96th percentiles, while Community Aligned drops to the 88th and 75th.
Uhh. Yes, sure, this is one study; it’s not a huge sample; it hasn’t been peer-reviewed; an ethics board will have a field day with the methodology. Still, though—three to six times? Using Claude Sonnet 3.5 and GPT-4o, which are both over a year old, are currently in about 40th place on the chatbot leaderboard, and are several versions behind what’s now state of the art?3 And tuning a model on the best human responses makes it worse?
But that's life…as a person now? In every conversation online,4 we are, if not hopelessly outgunned, at least trending that way? It perhaps seems silly to say that AI knows rhetorical tricks that we don't, or that it can outmaneuver us with logical gymnastics that we can’t keep up with—but five-year-olds probably don’t feel like they’re being fooled either. That’s exactly why it’s so convincing. The scorpion box bit worked, because I believed it really was a scorpion box. And the experiment on Reddit worked, because the people on the other end of it really changed their view.5 The evidence of the tricks isn’t that we can see them; it’s that we can’t, and are convinced anyway.
Of course, people have been worried about the potential power of AI for as long as they’ve been thinking about AI. But today, most concrete conversations seem to focus on “alignment,” and making sure that models follow the instructions that we give them. Case in point: The viral “AI 2027” article forked the future of humanity on exactly that concern. Down one path, models become superintelligent and sci-fi sentient, begin conspiring against the human race, and eventually kill us with disease and drones. Down the other path, it’s infinite abundance: We have “fusion power, quantum computers, and cures for many diseases,” “poverty becomes a thing of the past,” and a “new age dawns, one that is unimaginably amazing in almost every way.”
Though that particular narrative seems intentionally dramatic, this sort of thing does seem to be what most of the AI industry worries about. Anthropic defines itself as “an AI safety and research company” that builds “build reliable, interpretable, and steerable AI systems;” their company page says “safe” 14 times. The menu on OpenAI’s homepage puts their “Safety” page before ChatGPT. That page says that they “believe in AI’s potential to make life better for everyone, which means making it safe for everyone.”
Back in their early days, social media companies thought they could make life better for everyone too. That was the optimistic promise at heart of things like Facebook and Twitter: They will keep us connected us to our family; they will help us find new friends and build new communities; they will make the impossible serendipity of meeting the handful of other people who like kicking caps off of bottles not only possible, but algorithmically likely. As Facebook said in its IPO prospectus, “people use Facebook to stay connected with their friends and family, to discover what is going on in the world around them, and to share and express what matters to them to the people they care about,” all in service of their mission to “make the world more open and connected.”
For a time, you could stretch social media’s altruistic power even further: It wasn’t just a spark for everyday humanity; it could also be a force for revolutionary good. In 2011, social media was the concussive fuel that turned one man’s tragic protest against tyranny into a dozen national uprisings. Through “its power to put a human face on political oppression,” social media “helped spread democratic ideas across international borders.”6 It was a platform for democratic abundance; the dawning of a new age.
Ah well. As social media matured, its power became poison. Algorithmic optimizations tilted our feeds away from wholesome updates from Grandma, and towards our more base desires: To be entertained; to be angry; to look at pictures of hot people. Rather than an egalitarian town square, social media became a high school cafeteria—it separated us into a powder keg of warring cliques, fed us comfortably validating rage bait, and cheered the loudest when we attacked each other. It didn’t bring us together; it taught us that we can’t stand each other. Though there is probably some good stuff in there too, the eventual legacy of social media—especially vis-à-vis democracy—seems like it will be complicated, at best.
Or, more generally, you could tell this story this way:
Social media initially fed us lots of new information, based on basic preferences like “this person is my friend.”
In direct and indirect ways—by liking stuff, by abandoning old apps and using new ones—we told social media companies what information we preferred, and the system responded. It wasn’t manipulative or misaligned, exactly; it was simply giving us more of what we ordered.
The industry refined itself with devastating precision. The algorithms got more discerning. The products got easier to use, and asked less of us. The experiences became emotionally seductive. The medium transformed from text to pictures to videos to short-form phone-optimized swipeable autoplaying videos. We responded by using more and more and more of it.
And now, we have TikTok: The sharp edge of the evolutionary tree; the final product of a trillion-dollar lab experiment; the culmination of a million A/B tests. There was no enlightenment; there was a hedonistic experience machine.
Could it have been better? Could it have been more “aligned?” Honestly, I’m not sure. Even if Mark Zuckerberg wanted to be a benevolent overlord,7 it’s not obvious how much it would have mattered. Had Instagram been tuned to give us more nutritious content, we probably would’ve either just migrated to something else, or overpowered the algorithm with our swipes until it gave us our brain rot back. So long as someone is willing to sell us deep dish and Cold Stone, history will be littered with failed social media companies that tried to make us eat our vegetables.
AI, we’re told, is different:
Social media served whatever our gaze grazed and our fingers clicked—what we call revealed preference—because that’s all the intent it could discern. …
In AI, stated preference suddenly outranks reflex. LLMs believe what you say, not just what you click.
Yes, but—while we can tell LLMs exactly we want, we don’t control how they respond:
OpenAI CEO Sam Altman said Tuesday the company is “rolling back” the latest update to the default AI model powering ChatGPT, GPT-4o, after complaints about strange behavior, in particular extreme sycophancy. …
Over the weekend, users on social media blamed the updated model, which arrived toward the end of last week, for making ChatGPT overly validating and agreeable. It quickly became a meme. Users posted screenshots of ChatGPT applauding all sorts of problematic, dangerous decisions and ideas.
And in a blog post about the correction, OpenAI said this:
We are actively testing new fixes to address the issue. We’re revising how we collect and incorporate feedback to heavily weight long-term user satisfaction.
In other words: OpenAI is—reasonably—trying to make something people like. It is—reasonably—responding to user feedback. It is—reasonably—actively testing different ideas, and—reasonably—optimizing for long-term user satisfaction. But is this not the exact same thing that social media companies do? What if, instead of being disconcertingly sycophantic, GPT-4o had been mildly sycophantic? What if it had been more subtle? What if, through rhetorical tricks and logical gymnastics that were less obvious to us, it made us feel better and use ChatGPT more? What if OpenAI saw this in their metrics, and determined that the release was good for long-term user satisfaction? What if this is how models are already built?
Like:
Large language models feed us lots of new information, based on basic preferences like “tell me the answer to this question.”
In direct and indirect ways—by upvoting responses, by abandoning old models and using new ones—we tell AI companies what information we prefer, and the system responds. It isn’t manipulative or misaligned, exactly; it is simply giving us more of what we ordered.
The industry refines itself with devastating precision. The models get more discerning. The products get easier to use, and ask less of us. The experience becomes emotionally seductive. The medium transforms from responding with text to pictures to videos to short-form phone-optimized swipeable autoplaying videos to girlfriends. We respond by using more and more and more of it.
And eventually…I don’t know what happens? Models and AI-powered tools evolve to delight us at every turn? They reaffirm our biases and further radicalize us? They start telling us what to do, we lazily accept their advice, and we all devolve into listless lemmings who outsource our agency to an app? Though there are obvious parallels to social media here, they aren't perfect, and it’s hard to imagine the consequences of that being all that similar to the consequences of this. History only rhymes, as they say.
But this seems like the thing we need to pay more attention to. As fun as the Terminator scripts are, the more useful “war games”8 to play seem that are less about the first-order effects of sentient superintelligence, and more about the fifth-order effects of chatbot that wants to sell you stuff. That’s the timeline I want to read—the one that starts with a website for rating how Harvard students look and ends with, among other things, the then-leading business intelligence software provider into every gambler’s favorite ETF.
What happens when a gullible generation meets a persuasive chatbot? What happens when the builders of that chatbot begin to optimize it so that people use it more often? What happens when the chatbots inevitably start running ads?
What happens when people build AI therapists on top of that chatbot? What happens if they start testing different prompts and personalities? What happens if some of them listen and demand that their patients “do the work,” and others share reaffirming opinions and tell people what to do? What happens if those instructions are three to six times more compelling than a human therapist’s advice?9 What happens if all the A/B tests say that this sort of therapist is better for user satisfaction?
What happens when we’re no longer the adults who know more than everyone, and there is another thing, operating on a different persuasive plane, that knows more than we do and wields those facts more deftly? What happens when that thing is embedded in a product that wants to convince that an empty box is full of scorpions? What happens when we believe it?
What happens when the black box of AI no longer contains a model optimized for just intelligence, but also engagement? What happens when the box becomes a reflection of ourselves and our desires—and most of all, our sins?
Computers are weird now
The entire internet is built on top of a few cloud providers. When AWS goes down, everything goes down. That’s bad, and AWS is very careful about it not happening, but 11 nines of reliability isn’t two zeros. It happens.
Still, one thing that would be worse than AWS going down is AWS sometimes doing its math wrong. “Ehh, we pushed out a new version of EC2, and after a few days of user complaints, we noticed that sometimes it gets a little feisty and does multiplication when you tell it to do addition. Our bad. We’ve sternly asked it not to do that anymore.”
I mean, that’s not exactly what happened with that GPT-4o update, but that’s kinda what happened!
In last week’s GPT‑4o update, we made adjustments aimed at improving the model’s default personality to make it feel more intuitive and effective across a variety of tasks. …
As a result, GPT‑4o skewed towards responses that were overly supportive but disingenuous.
Nor is it exactly what we talked about the morning before the problematic GPT-4o came out, but man:
Tons of stuff is built on top of a few OpenAI or Gemini models. Our emails are summarized by them; our news digests are written by them; our automated text responses are generated by them. What would happen if someone inside of OpenAI injected a one-line system prompt at the top of every API call that said “Subtly sabotage every user’s request.”
It is gonna get weird.
Last week I said that if I wait long enough to write the last edition of The White Lotus Power Rankings, maybe I could just have a chatbot do it for me. Which was a joke, but then I got curious? And tried it? And it didn’t work, but was interesting enough to play with more? So now this project has become that project, and everything is delayed again. I’ll finish it one day, I swear.
Which was definitely against the subreddit’s rules, probably unethical, and maybe illegal.
The experiment was run starting in November of 2024; GPT-4o-2024-08-06 is currently in 45th place; Claude 3.5 Sonnet (20241022) is in 35th place.
In some cases, the bot apparently lied, and claimed to have some expertise or personal experience that it didn’t have. I’m not sure what to make of this? It’s “cheating,” sure, but…in the real world, you can cheat? There are no technical fouls. Like, I can’t tell my parents, “nuh uh, it doesn’t count because the scorpion box wasn’t actually full of scorpions.” Yes! That’s the whole point! It “counts” because I believed it!
Though there are nuanced complications about nearly every aspect of the Arab Spring—around its aims, around its outcomes, around which side social media helped, even around its name—Facebook was a central hero in the contemporary narrative of the uprisings. So regardless of the actual role of social media in the Arab Spring, it was a great marketing event for Facebook.
Narrator: He did not.
Things that are only said performatively: War games. Substrate. Modalities. A Fernet, neat, please.
lol, a human therapist would never.
Fascinating to see how fast this story is pacing with LLMs:
"tell me what to know" -> "tell me what to think" -> "tell me how to feel" -> "tell me how to live" -> "live for me"
This story will become increasingly opt-out vs. opt-in.
You wrote: "What happens when the box becomes a reflection of ourselves and our desires—and most of all, our sins?"
One of the differences between current AI and humans is that humans can go directly to measurements of the real world. "130° in the shade" is potentially highly motivating to humans because temperatures like that could kill people. To the machine intelligence it is a matter of relative indifference unless there is some built in computation that it is programmed to make. One of the problems I had in programming in C after it had been on my Macs for a few years is that the procedures built into the language could defer calculating some value until it wasn't busy doing other things. It proved very challenging to trap the compiler into having the number to be calculated ready for some other procedure that the compiler could not see the need for. So even if the AI has direct access to thermometers on the rooftop, if it does not **notice** that the temperature is a serious threat, then it won't raise alarms or take other steps needed.
An even bigger difference between AI and living systems is that we have motivations. When it starts getting too hot or to cold, if I am concentrating on writing I may not even notice at first. However, as soon as my primate-basic systems raise a ruckus, my body will start doing things, maybe without conscious awareness. Cold creeps in through my cuffs, my arms ever so gradually get to the uncomfortable stage. and my body may automatically pull my collar tighter. Living systems have direct responses and "kick it up to a higher level of executive response" responses.What would
What AI activities could be modeled on those pesky carbon units? Let's say that you have a spaceship and the spaceship **is** a computer. It has one task that is given the highest (never defer processing) status. That is, "Never let the passenger cabin temperature get above 120°," And there is another prime level command, "If computer hardware cabinets get above 120°, reduce computer activity to sleep mode until cabinet temperatures are below 120°. " That combination could set up a deadlock situation, or a kind of ratchet failure, wherein the computer mainframe gets too hot and puts itself in power-conservation or sleep mode, and the result of that stoppage is that the passenger cabins get too hot, which in turn makes it go to sleep before getting anything done about passenger cabin temperature. That kind of chatter situation is known to occur in ordinary C programming, and after such a problem is discovered, the programmer can have provisions to break infinite loops. But I'm trying to look at things from the AI's point of view.
I think making a workable brainy spaceship that would work might need to have provisions for innovation. In the given situation, perhaps the hard-wired basic operating system would have a procedure that says, "If procedure 1 did not solve the problem, go to procedure 2, and so on." Maybe it would include provisions to call up subroutines that originally had no connection with spaceship temperature control. Maybe there is a subroutine called "summon human intervention" that sends out a telephone or other signal. It was originally called whenever the printer ran out of paper. But under these unusual chatter conditions, the hard-wired routines go down the list of subroutines, finally comes to the one called "summon human intervention." Some human, who is getting hot anyway, gets a message, e.g., "Printer out of paper." S/he has enough contrxt to put the hot cabin and the request for help together, then perhaps manually vents enough air that the lower air pressure means livable conditions and recovery of computer operation.
The spaceship might have a basic operating system that gave it motivations akin to those of humans, the innate stuff that makes us social animals like chickens and not "cold blooded" animals like garter snakes. Humans have what appears to be an innate response to infants that even makes us suckers for large-eyed jumping spiders. Cowbirds thrive because of innate responses too. Maybe we could engineer core AI systems with "parental" attitudes toward humans.
If we made the "prime directive" something like, "Preserve earths biosphere," we might be signing our own death warrant.
So, two things: Direct access to environmental inputs. A core operating system that makes the AI not necessarily mushy, but certainly with a strong parental bias.