What’s the Hadoop of today’s data ecosystem?
Shortly after I started working in tech in 2012, I attended my first data conference: Strata + Hadoop World. Held at the San Jose Conference Center, the same venue that hosted Apple’s and Facebook’s developer conventions, it was a confident coronation of the next big thing. Data vendors launched new products to packed auditoriums; customers, in scripted fireside chats, told us how digitally transformed they were; venture capitalists assured us this was just the beginning. The event’s premier sponsors—Cloudera, Hortonworks, and MapR—had raised hundreds of millions of dollars, pitched their software out of huge pavilions in the vendor hall, and were on the fast track to IPOs. Everything was ascendent: The startups, the ecosystem, and, most of all, the revolutionary promise of Big Data.
It didn’t go so well after that.
Hortonworks’ shares fell by more than seventy percent in the two years after its IPO. When Cloudera went public in 2017, it was worth half its private market valuation. MapR blew up, and Hewlett Packard Enterprise bought its “business assets” for an undisclosed sum.1 Strata rebranded to Strata Data & AI, and then got canceled. Hadoop eventually became a one-liner: At Gartner’s data conference this year, our fling with Hadoop was the punchline of a lot of self-deprecating jokes, like an embarrassing high school ex that was never good for us.
Of course, we didn’t discard data entirely; instead, we moved on to a new thing. Hadoop-based or -inspired data systems like Hive, Pig, and Impala got replaced by cloud data warehouses, which looked like ordinary relational databases, but very big, very fast, and relatively cheap. These products became platforms for dozens of new categories of data tools and hundreds of new companies.
Colloquially, we’ve come to call this collection of products the modern data stack. But that term represents more than just a set of tools; it represents the epochal sequel2 to the era of Big Data. MapReduce was hard to write; Hadoop was hard to maintain; the data science initiatives that these tools were supposed to unlock were plagued by unusable data and brittle pipelines. The modern data stack was the reactionary counter-movement to these problems.3 That movement includes tools and philosophical beliefs—SQL-first, cloud-first, decision support over fancy data science, modular over monolithic—that emerged organically, and were eventually canonized by dbt Labs. Their viewpoint became the industry viewpoint, something something ZIRP, and the “modern data stack”—as an ecosystem of tools and as The Way make data valuable—went vertical.
But every movement, especially one as hyped and frothy as this one, will inevitably get some ideas wrong. Surely, something we’re excited about won’t pan out; surely, something will be our Hadoop.
Over the last eighteen months, I’ve asked a number of people what they think that might be. I’ve gotten a range of answers: ELT, streaming, the centralized warehouse, data catalogs, observability, the ever-impressive, long-contained, often-imitated, but never duplicated data mesh. Worthy candidates, all, but I can’t help but wonder if the answer is the entirety of the modern data stack. Just as the era of Big Data gave way to the modern data stack, it’s starting to feel like the era of the modern data stack is on the verge of being overtaken by the next counter-movement.
99 problems, 15 standards, 1 landscape
Like every startup pitch deck, every data talk has a few mandatory slides in its preamble. For years, one of the mainstays was a chart showing how data volumes double every two years. The talk track was always the same: “Businesses are drowning in digital papers, and we’re building MetaQuery.io to help.”
No longer. We’ve stopped talking about how data volumes are doubling, and started talking about data tools are doubling.4 Every presentation now opens with a screenshot of Matt Turck’s MAD landscape from 2012—”This is what customers used to have to choose from.” Then, with a dramatic slide change—“Today, it’s this”—they show Matt’s 2023 landscape. “And we’re building MetaQuery.ai to help.”
The punchline is the entire modern data stack. It’s the now-widespread acknowledgement that there are too many tools, that we’ve created too many thin categories, and that what was meant to be a usable rewrite to Hadoop is now an unnavigable labyrinthine of tools and intertwined costs. If a Big Data platform was too hard to set up, deploying the modern data stack has been, if anything, too easy.
These frustrations have shown up in a number of ways. People complain about tools being disconnected and hard to manage. Meta-vendors launch products that manage other vendors for you.5 There are constant rumblings about how much metered data warehouses cost. Some consulting firms market themselves around helping companies clean up and organize their dbt projects. Data quality is an ever-present problem. And new concepts like data contracts, active metadata management, and data control planes are direct efforts to control—and make a market from—the chaos that the modern data stack can often create. We had 99 problems, so we used the modern data stack—and now we have 100 problems.
In fairness, none of these issues are necessarily fatal, or even unexpected. Progress isn’t linear. We experimented; we found some things that work and some that don’t; we’re experimenting again. The best problem any new technology can have is that people want to use it too much.
Moreover, most of these new ideas aren’t rejections of the modern data stack’s foundational tenets, but iterations on top of them. Though attention-seeking hecklers criticize the modern data stack’s edges because there are likes and subscribers to be had by starting fights,6 most of us still agree with the modern data stack’s gospel. We complain, but rarely offer an alternative philosophy.7
Six months ago, I thought this was a steady equilibrium—two steps forward, one cynical blog post back, and messy progress for years to come. But in the last few months, I’ve changed my mind. I now believe we’re in the liminal8 space between two eras. In a few years, we’ll see this time as when we faded away from the modern data stack, and moved towards intelligent infrastructures.
The next discontinuity
I have a theory that technological cycles are like the stages of Squid Game: Each one is almost entirely disconnected from the last, and you never know what the next game is going to be until you’re in the arena.
For example, some new technology, like the automobile, the internet, or mobile computing, gets introduced. We first try to fit it into the world as it currently exists: The car is a mechanical horse; the mobile internet is the desktop internet on a smaller screen. But we very quickly figure out that this new technology enables some completely new way of living. The geography of lives can be completely different; we can design an internet that is exclusively built for our phones. Before the technology arrived, we wanted improvements on what we had, like the proverbial faster horse. After, we invent things that were unimaginable before—how would you explain everything about TikTok to someone from the eighties? Each new breakthrough is a discontinuity, and teleports us to a new world—and, for companies, into a new competitive game—that would’ve been nearly impossible to anticipate from our current world.
Artificial intelligence, it seems, will be the next discontinuity. That means it won’t tack itself onto our lives as they are today, and tweak them around the edges; it will yank us towards something that is entirely different and unfamiliar.
AI will have the same effect on the data ecosystem. We'll initially try to insert LLMs into the game we're currently playing, by using them to help us write SQL, create documentation, find old dashboards, or summarize queries.
But these changes will be short-lived. Over time, we'll find novel things to do with AI, just as we did with the cloud and cloud data warehouses. Our data models won’t be augmented by LLMs; they’ll be built for LLMs. We won't glue natural language inputs on top of our existing interfaces; natural language will become the default way we interact with computers. If a bot can write data documentation on demand for us, what’s the point of writing it down at all? And we're finally going to deliver on the promise of self-serve BI in ways that are profoundly different than what we've tried in the past.9
As these changes—and dozens of others that we can’t anticipate—start to come into focus, a new set of philosophical beliefs will likely coalesce around them. We’ll figure out new ways to structure AI-powered technology stacks and AI-enabled data teams; at some point, as the good ideas separate from the bad, someone will pin a new set of theses to the modern data stack’s door.
If this happens, the tenets of the intelligent infrastructure could quickly outpace those of the modern data stack as the next new thing. AI technologies are advancing at a breakneck pace, and they've already captured the imagination of the enterprise data ecosystem in ways the modern data stack still hasn't. For the Fortune 500 buyers at the recent Gartner conference, the modern data stack is an up-and-coming flirtation. LLMs, by contrast, were already everywhere, with vendors and CIOs scrambling to adopt it.
Evolve, or die
The week after Gartner's conference, I attended Data Council in Austin. At one point, I found myself in a conversation with a newly-minted founder who told me they were excited about their company's potential because it could ride the momentum of generative AI and of the modern data stack.
On AI, yes, of course. ChatGPT proved to be the only thing that could replace Elon Musk as the permanent main character on Twitter;10 weeknight AI meetups in San Francisco are now better attended than major data conferences. For better or for worse, there’s no shortage of momentum in AI.
But the founder’s comment about the modern data stack struck me as something between anachronistic and contradictory. Though I don't believe the tools that make up the modern data stack will fail, the modern data stack as a movement seems incompatible with the rise of AI. It's a philosophy that was designed for a world in which reasoning through a data problem with a robot was a fantasy. That philosophy may be no more suitable for the world run on LLMs than MapReduce is for a world run on Snowflake.11
That’s is the reality that data teams and data vendors both have to reckon with. Our world is changing; the future we’ve been fighting to create may soon become the past that we have to fight to escape. For the modern data stack, this is it.
And for a $69.99 negotiation fee, GoDaddy will help you negotiate to buy the domain mapr.com.
Get it???
I’ve joked before that the best definition of the modern data stack is data tools that launched on Product Hunt. It’s not an entirely accurate definition—Snowflake was never on Product Hunt, for example—but it gets at the right idea for me: That the modern data stack isn’t an actual stack, but the era when data tools become cloud-first, bottoms-up, and community-oriented. This definition also excludes tools like Oracle’s Autonomous Data Warehouse and PowerBI, which I suspect most people would say aren’t clear members of the modern data stack.
YC’s Law: The number of data companies in YC doubles every two years.
I’m a personal investor in both 5x and Mozart Data.
You could potentially argue that the data mesh is an alternative ideology. I don’t think I’d quite agree with that; to me, it feels more like a recommendation on how to make the modern data stack scale within very large companies. But some people may see that differently.
I swear, I went thirty-some odd years without knowing that this word existed, and now it’s everywhere.
Given that my day job is to build a BI tool, I’ve thought way too much about this one. A longer post for a different day.
It’s unclear who will destroy humanity first.
Of course, it’s also possible that AI is a flash in the pan; that the modern data stack is the tortoise that’s still going to win the race; and that now is the perfect time to build boring analog products when all the competition is distracted by something else. During a gold rush, sell pickaxes—or stay in Ohio, and be the only carpenter in town.
Love the idea of technological progress as teleportation instead of incremental changes!
LLMs is clearly on the level of SQL, for the data industry. There's pre-SQL data and post-SQL data, same will be for LLMs.
That being said, it is incredibly exciting to be in data right now. We probably won't be wrangling dashboards anymore, but monitoring, deploying, tuning, feeding AIs. It'll be operationally critical to serve every deal and customer, and data won't be a nice to have, it'll be a must.