How much is the news worth?
If you're OpenAI, it depends on what you’re building—and when you’re asking.
There are two ways to think about ChatGPT.
One way is that it’s exceptionally good at doing stuff. We give it an instruction, and, because it’s been trained on an enormous corpus of human language, it can respond in a way that would, if anything, fail the Turing test for being too capable. In an instant, it can write a French sonnet about New Year’s Eve; it can create an argument for why it should be a felony to write songs in C major; it can understand a 700-word blog post from which all the vowels have been removed. For tasks like these—writing emails, creating lesson plans, finding and booking a restaurant for a six-person get-together next Friday in New Orleans—ChatGPT is valuable because of what it can do.
The second way to think it is that it knows things. We ask an LLM like ChatGPT a question; it tell us the answer. It’s valuable because it’s read every encyclopedia and textbook and Reddit post in the world, and can summarize—and in some cases, recreate—what those things say. Though LLMs don’t store this information in a traditional sense—there is no file in GPT-4 that contains the full text of the Declaration of Independence, for example—ChatGPT can still rewrite the entire document. In this way, LLMs aren’t useful because of what they can do, but because of what they know—like who Calvin’s babysitter was, who scored the most points in a WNBA game, and which song starts with the notes “da da da dum.”
This second version of ChatGPT—the one that, above all, knows things—is the version that caused people to declare Google dead, and caused Google to freak out, when OpenAI released it. Whereas Google can find links that might answer your questions, ChatGPT answers them directly. Its appeal was as the ultimate lazyweb.1
If this is the role that LLMs come to occupy—Google 2.0, basically—copyrighted content from books and news publishers is immensely valuable to OpenAI. To replace Google, ChatGPT would need to “know” most of what Google can find—and Google can search the entire internet, including copyrighted websites. Without access to that content, ChatGPT isn’t a better Google; it’s a chatbot for summarizing Wikipedia and Reddit.
If ChatGPT ultimately occupies the first role—a bot that does stuff; an agent—OpenAI doesn’t need copyrighted material. An AI agent would be useful for the same reasons that human agent is useful, and human agents are useful because they can complete complex tasks based on ambiguous instructions. They don’t need to know that much; they need to be able to communicate, reason about problems,2 and look stuff up. And just as a human assistant can be a good assistant without memorizing the script of Star Wars or what was said in the Wall Street Journal yesterday, an LLM can probably be trained to be a useful agent without being trained on copyrighted content. Give it enough high-quality text, from any source, and it can learn to talk as well as any of us.3
Despite the initial panic at Google, I’d be surprised if ChatGPT comes for search. Though that’s partially because LLMs aren’t, on their own, reliable narrators of fact, it’s much more because the economic value of agents that do stuff is potentially far greater than the economic value of a chatbot that knows stuff. “We can help your accountants answer common questions about tax regulations” is a nice pitch, but a fundamentally incremental improvement over Google; “we can create an infinite army of cheap digital labor that can do a lot of the tasks your employees do” is transformative. The frontier of ChatGPT potential isn’t replacing Google, but in using Google4—and in the same way that the cost of manual labor made industrialization all but inevitable, the cost of skilled labor probably makes the agentization of fake email jobs5 all but inevitable too.6
In other words, for the enhanced search engine that OpenAI is today, copyrighted content is necessary. Omniscient oracles need to read the news to be omniscient. But for the autonomous agents they’ll likely become, copyrighted material is simply convenient—news websites, for example, are generally reliable, accurate, well-written, constantly produced in large quantities, and can be collected from relatively centralized sources. But any sufficiently diverse body of text will do.
Regime change
Well, hypothetically. I don’t know if you can actually build ChatGPT without copyrighted content because OpenAI definitely built ChatGPT with copyrighted content.
Late last month, the New York Times sued OpenAI for doing exactly that. Their case alleges that OpenAI violated copyright law by using millions of Times’ articles to train their models without paying proper licensing fees; that this unlicensed use of the Times’ intellectual property allows OpenAI to “compete with and closely mimic” the Times; and that this is providing material economic benefits to OpenAI while causing considerable economic and reputational harm to the Times. The Times “seeks to hold them responsible for the billions of dollars in statutory and actual damages,” and is calling for the destruction of “all GPT or other LLM models and training sets that incorporate Times Works”—which, presumably, includes every publicly-available model created by OpenAI.
Some people said it’s a dumb and short-sighted move by the Times7, akin to Blockbuster suing the internet instead of building a streaming service. But that seems to miss the point of the lawsuit: It’s not about futilely trying to hold back the tide of technological progress; it’s about making money from that tide.
The Times surely knows they can’t whack-a-mole the generative AI genie back in the bottle. I also doubt that they see generative AI isn’t an immediate threat to journalism itself—LLMs can’t do on-the-ground reporting, and the early efforts to replace writers with AI-powered bots have been mostly disastrous. Instead, generative AI is a potential threat to the journalistic business model. If people can ask for summaries of the news from ChatGPT—if The Daily can be recreated by an LLM that ingests yesterday’s news stories and reads you a summary—then, as the suit alleges, OpenAI’s products “undermine and damage The Times’s relationship with its readers and deprive The Times of subscription, licensing, advertising, and affiliate revenue.”
That argument makes for a reasonable lawsuit, but it's probably not the real goal of the lawsuit. As others have said, Napster is a useful precedent. Napster—and the internet more broadly—didn’t replace musical artists; it replaced their methods of distribution. Though the music industry sued Napster out of business, the real importance of the Napster lawsuits was that they eventually helped create a new economic regime that ensured musicians and record labels would still get paid when streaming digital music replaced records and CDs. The New York Times’ lawsuit probably won’t bankrupt OpenAI, but its foundational objective seems the same: To establish an economic and legal precedent that protects publishers if chatbots and LLMs replace search engines as the lobby to the internet, and still enriches them if they become trillion-dollar generative AI agents that no longer need the news.
Suppose, for example, that a very prominent Facebook user sued Facebook by arguing that Facebook was monetizing their labor and illegally withholding compensation for it. That lawsuit would seem ridiculous now, but in 2006, when we were all figuring out what social media was, it could’ve been seen as legitimate. And had the user won, the economics of social media would’ve been redrawn, potentially to a point where today’s model—one in which users don’t get paid—would be the one that sounds ridiculous.
In fifteen years, there will be legal precedents about how generative AI vendors compensate content creators for using their licensed content. There will be norms and expected standards. And, perhaps even more importantly, if generative AI vendors are building agents and not search engines, they won’t even care about using licensed content.
But today, they do care, a lot. Large libraries of high quality text is suddenly very valuable to some very rich people—and the Times has a lot of it. So if you’re the Times, why not sue? Not as a defensive move, but because to try to create a new, legally enforceable revenue stream, at a time when you have maximal leverage—your data is still critical, court precedents are still being established, informal norms are in your favor, and the frenzied AI land grab is putting pressure on every vendor to move faster at all costs.8
In this climate, OpenAI probably can’t risk going to trial. If they lose—and they could; copyright lawyers have said that OpenAI is “dangerous legal territory”9—it could cost them billions of dollars, require them to destroy models that they’ve spent enormous amounts of time and money building, and spawn dozens more lawsuits from other copyright holders.10 Even if they won, a trial could put their development teams in months or years of legal limbo, unsure of what training data they can and can’t use.11
And for the Times, the worst case is that you lose, pay some legal bills, and go home. The middle case is you force OpenAI to the negotiating table, and settle on a marginally better licensing deal that you would’ve otherwise gotten. And the best case is you use some legal leverage and OpenAI’s paranoia about falling behind in the cutthroat market to secure a long-term precedent that gives you a perpetual piece of the generative AI pie.
Or, if nothing else, you might make a tech executive mad and convince them to buy you out for a 38 percent premium, I guess.
“Dear lazyweb, what are some good fake court cases I can use to defend my client who’s been accused of constantly making stuff up?”
I understand LLMs can’t reason in a formal sense, but they can create artifacts that are indistinguishable from things that have previously required human reasoning to create.
There’s also some evidence that general knowledge makes LLMs more effective assistants than specialized knowledge, even when they’re asked to do specialized tasks.
Bing, whatever.
This, the yutes tell me, is basically an office job.
Notably, today, LLMs are probably ahead of the world’s ability to use them as these kinds of agents. To adapt an analogy I’ve used before, LLMs are like cars: For them to be maximally useful, you need the car and a bunch of roads. Our current LLMs are decent cars, but we have very few roads that are designed for them to drive on. The eventual development of AI agents will depend on our ability to build those roads. But, because the economic benefits of having both are so huge, it’s hard to imagine people not eventually building the roads.
Actually, they said that “asking to be deleted from the amortized collective memory compressed in GPT4 weights” is “information theoretic self-immolation,” which, like, sure, to infinity and beyond, man.
OpenAI has licensing deals in place with The Associated Press and Axel Springer, so there are private precedents that suggest that OpenAI believes that copyright holders will ultimately be entitled to some compensation.
I also have so many questions about the terms of these deals. The details of the deal with AP weren’t disclosed; the Axel Springer deal reportedly “includes a ‘performance fee’ based on how much OpenAI uses its content” and “is worth more than $10 million per year.” Is OpenAI paying per article they use to train models? Do they pay for a kind of perpetual license that allows them to use that article as much as they want, or do they pay more if they set higher weights on Axel Springer articles? (One of the arguments in the Times’ lawsuit is that OpenAI sampled the Times’ content more frequently than other sources when they trained their GPT models.) Does OpenAI pay Axel Springer just for access to training data, or do they somehow get paid on inference? How would that even work? And how does Axel Springer have any way to monitor how much OpenAI is actually using their content?
I have no idea about the specific legal merits of the Times’ lawsuit, but the basic facts of the case, which nobody seems to dispute, don’t look great for OpenAI? Like:
OpenAI scraped millions of New York Times articles without paying for them.
Sometimes, OpenAI repeats those articles verbatim.
OpenAI is making tons of money by using those articles for commercial use.
The New York Times is not.
I don’t know anything about the fair use doctrine, but that seems bad!
And maybe even more importantly, it could destroy the experience of using ChatGPT. OpenAI already appears to be scrambling to prevent their models from creating other copyrighted work. Imagine if these warnings and restrictions were tighter, and every time you asked it to do something that might touch a copyright, it tried to awkwardly steer away from it. It’d be like watching Any Given Sunday where everything is supposed to be the NFL, but nothing actually is the NFL.
Here’s a fun tin foil hat theory: The case is actually a proxy war between Google and Microsoft, in which the New York Times, a Google partner, sues OpenAI, a de facto Microsoft subsidiary, to force them into some sort of legal and technical purgatory to slow them down. Is it true? Probably not; I made it up. But it would make the inevitable OpenAI movie even better.
It seems that "high quality text" will almost need a clear quantification of "quality" in the future for accurate monetization to occur.
My 2024 brain has zero ideas for quantifying the quality of text across the web. I could start by saying heuristically that the average NYT article would probably be at least a 7/8 out of 10, while The New York Post couldn't be more than a 4/5 out of 10. But then, after major news outlets, you get into weird data sources like Reddit where r/wallstreetbets would score a 0/10 but a subreddit like r/personalfinance could score an 8/10, and then of course also vary by user and context.
As I'm writing this, it seems that only large organizations like the NYT would have the argumentative and lobbying/lawyer power to ever have their text monetized. And then how do you monetize it?
Interesting read and thought provoker. I originally thought the court case was a bit silly from the Times but now I agree with your point that they are just looking to "make money from that tide"
I was just having a conversation the other day with someone who thought of an LLMs usefulness just being for what it "knows". We talked about an example of an LLM that is a company historian that knows what decisions were made and why they were made at your company. Just as soon as this idea came up we discussed the problems with a historian. Granted our problems were mostly unrelated to licensing - although privacy/HR was a potential problem. However - the biggest consideration was how companies would want to redact certain information, restrict other information, and completely forget a 3rd set of information. Those would be pretty difficult activities to achieve with 100% accuracy - and mistakes could be costly, dangerous, harmful etc. Plus what company (or person) actually wants a perfect memory of what happened - we choose to forget stuff all the time. :-) Anyways - all that to say I'm in the camp of "doing" being more useful than "knowing". However, I think part of making the doing useful is referencing external data through databases, apis, and web-crawling (like a search engine would). Therefore the knowing is less Oracle like and more search engine like. However if the Oracle route was chosen I agree - there is no way the thing would work without lots of copy-written material.