Discover more from benn.substack
Should we be grateful for the modern data stack?
It’s easy to complain. It’s easy to be skeptical. It’s easy to say it’s all overhyped and overvalued. It’s easy to look at the rough edges of an unfamiliar tool, and wonder why such obvious problems haven’t been fixed.
There’s a place for this sort of commentary—this kind of community of inquiry is how we stumble forward together. But in only pointing out those flaws, looking at the uneven ground at our feet, it’s also easy to lose track of just how far we’ve traveled. Even in the relatively short amount of time I’ve been working in the data industry—I took my first tech job as an analyst at Yammer, a B2B SaaS company, in 2012—the amount of collective progress we’ve made has been dizzying.
In the early 2010s, the data teams at LinkedIn, Facebook, and Zynga were seen as revolutionaries, but inimitable ones: They were funded by massively hyped tech companies, flush with technical experts, and worked with some of the biggest and most monetizable datasets in the world. They were blazing trails, but trails that weren’t easily followed.
Yammer’s analytics team—my introduction to the data world—was a pioneer for the everyman.Though Yammer was a “high-growth” startup, it was a pedestrian one, no different than dozens of other tech companies of its size. Unlike the data science team at Google, which was organizing all the world’s information, our objective as analysts was to use product usage logs and CRM data to help a sales team hawk software to IT departments. Rather than trying to solve new problems or invent new uses of data, our explicit goal was to apply techniques stolen from Facebook and social gaming to sell enterprise software. To achieve that end—which is more or less the goal of nearly every data organization now—my bosses built a data organization that looks a lot like those of 2022, but without the glossy sheen from ten years of IT consumerization, YC-backed data startups, and trendy thought leadership. In other words, we were a modern data team without the modern data stack.
It’s a team, then, that offers answers to two questions: How did data teams actually work ten years ago? And how would teams have to work without the last decade of growth and innovation?
Just as our team wouldn’t look entirely out of place in a startup today, the architectural outlines of our data stack would be pretty familiar too. The modern data stack didn’t exist—this was 2012; Redshift, which arguably catalyzed the entire movement, hadn’t been released yet; we were still all enamored with the possibilities of Hadoop—but we ran a proto-version of it: We ingested data from a handful of sources into a centralized warehouse; we transformed it in the warehouse using a DAG of SQL scripts; we pushed that data out through ad hoc analysis, BI reports, and a handful of department-specific tools.We didn’t recognize any of these patterns by their current brands—ELT, analytics engineering, data apps—but the shapes were the same.
The experience of using it, however, was not.
The storage layer
Vertica, a columnar-store analytical database, sat at the center of our data stack. Like most databases at the time, we couldn’t buy a hosted version; instead, we bought a license—for about half-million dollars a year—to run the software in some server rack we leased in a data center in San Jose. Since we were responsible for its own uptime, Vertica had to be maintained by a DBA, whose job was to monitor the cluster's instruments, and make sure everything was healthy.
Routinely, it was not. Bad queries had a tendency to send nodes sideways, which would not only clog up the warehouse, but would also require a coordinated set of operations to rehabilitate. What exactly caused these problems and how exactly were they fixed? I never quite knew. But I knew I had to be delicate, or else I'd send the people staring into the matrix on vertical monitors into furious fits of SSH'ing in and out of remote boxes, while everyone else fielded angry emails from execs about their dashboards being out of date.
After Yammer was acquired by Microsoft, we dabbled in “big data” alternatives to Vertica, eventually settling on Cosmos, Microsoft’s then-internal and now-public NoSQL warehouse. I remember being told there was useful data in Cosmos, though I never figured out what it was, because accessing it required writing queries in some Frankenstein language that mixed SQL and C#. So I stayed away, and let other people write the 800 lines of .NET-flavored MapReduce that was needed to parse SharePoint’s usage logs into an estimate of daily active users—and hoped that knowing that number wasn't important to the company.
The ingestion layer
Our ETL architecture was strikingly modern—but, much like our warehouse, plagued by problems that don’t concern today’s data teams. Nearly all of the data in Vertica came from three sources: The application Postgres database that powered Yammer’s product, an event stream from that same application, and Salesforce. In all three cases, we did very little transformation in flight.
To my knowledge, the application replica and its event stream were reliable and required very little babysitting—less, in fact, than Postgres-to-Snowflake-or-Redshift connectors do today. Salesforce, by contrast, was a nightmare. We had to build our own service for extracting data from its APIs into Vertica, and, despite an engineer working on the problem nearly full-time, it was a perpetual wreck. The Salesforce APIs were brittle. The Salesforce admins on the other side of the office would sometimes change Salesforce schemas, which would require updates to our connector. When something broke, extraction jobs would get backed up, and, like Vertica, require careful handholding to revive.
Because of this ongoing fire—and the cost of containing it—we generally avoided sourcing data from other third-party sources. An effort to get data from Exact Target, a marketing automation tool for sending emails, took several months; an effort to get data from Jobvite never made it past a hack day.
The transformation layer
Once loaded into Vertica, raw data was modeled by Integritie, our homemade transformation tool. It was similar to dbt, minus a bunch of developmental niceties: There was no IDE for it; you couldn’t run models in staging environments; jobs were written in raw SQL, without Jinja or macros.Instead, we followed a “works on my machine” model of development. Run queries locally, copy them into the Integritie repo, deploy it, and hope production runs as expected. Sometimes it obviously did, and sometimes it obviously didn’t. But most of the time, the query executed correctly, and—without any kind of testing environment or way to see how tables actually changed—our best gauge of how well something worked was if someone complained about it later.
Given how easy it was to overrun Vertica with long-running queries, we spent a lot of time optimizing the jobs in Integritie. Models were almost required to load incrementally; rebuilding large tables was costly, and had to be manually orchestrated. This amplified the challenges of resolving upstream issues like Vertica getting backed up or Salesforce syncs stalling. We couldn’t just unplug it and plug it back in; we had to make sure failed jobs didn’t restart all at once, and nurse the pipeline back to health.
The consumption layer
All of this effort and expense—which probably totaled a few million dollars a year in software costs and salaries—powered a handful of applications that drove “business value.” The analytics team spent most of their time in a browser-based SQL editor, with some very basic charts on top. We got questions, wrote queries in the tool, and sent URLs with answers back to whoever was asking. These links were our crude audit trails, and internal communications, Excel files, and Powerpoint speaker notes were littered with them.
The data engineering team also built a few proper data apps for more durable use-cases. There was an executive dashboard that showed our most important KPIs, with a few simple filters; there was an site for profiling customers and aggregating how healthy their accounts were; there were a set of pages that were embedded directly inside the Yammer application so that customers could see a few vanity metrics on how they were using the product. These were applications in the truest sense: They were built in Rails, hosted on their own servers, and ran on top of dedicated databases. As analysts, our only involvement in the development of these tools was to create tables in Integrie that mapped to exactly what each application needed. The app would then retrieve our tables and—assuming we’d formatted everything correctly, which I often didn’t—update the apps.
Most of these tools, however, were feeders into Excel.The email lists we sent to marketing got exported to Excel, and uploaded into Marketo. Product managers, wanting to tinker with A/B test results, exported reports to Excel. The monthly board deck was built in Powerpoint using charts that were generated by copying query results into “raw” data tabs in a giant Excel workbook. None of it was peer reviewed or versioned, and only some of it was centrally organized. Instead, much of our work—the culmination of all the technology that sat underneath it—was scattered across Dropbox folders of SQL queries and
The questions we should be asking
The tools we have today—built and supported by thousands of people across dozens of companies—represent a profound leap forward from what we had then. And their effect extends beyond easing the daily frustrations of existing data scientists; they also made the work we did in 2012 accessible to a far greater range of companies and aspiring analysts and analytics engineers. Nearly part of the industry is breathtakingly easier, faster, more powerful, and more reliable than it was a few short years ago.
Because there’s one nagging inconvenience in the comparison between today’s data teams and the one I was on in 2012: Yammer’s data team was as impactful as any that I’ve ever worked with. It was a key part of the product development process; its members were honorary members of the marketing and customer success leadership groups; it was respected, in-demand, and had a voice in the strategic direction of the company. And all this was done on top of technology that was, relative to what’s available today, fragile, narrow, expensive, and powered by now-archaic computing capacity.
That’s the paradox we need to solve. Why has data technology advanced so much further than value a data team provides? Does all of this new tooling actually hurt, by causing us to lose focus on the most important problems (e.g., the data in Salesforce) in favor of the shiny new things that don’t actually matter (e.g., the data in our twenty-fifth SaaS app)? Has the industry’s talent not caught up with the capacity of its tools, and we just need to be patient? Is the problem more fundamental? I’m not sure.But if our 2032 selves want to be as grateful for 2020s as we should be for the 2010s, those are the next questions we need to answer.
Shoutout to sycophantic egomaniac George Hotz for taking this hubris to breathtaking new heights, only to immediately change the goal posts and ask someone else to kick the ball through them. (Also, how does this guy not only have a Wikipedia page, but has one that has more references than the pages for Denzel Washington, Dan Quayle, Diana Taurasi, and the Battle for Fort Sumter? Ashley Feinberg, we need you.)
Where the everyman is a drug- and alcohol-fueled VC-backed Silicon Valley startup for which the typical laws of economics—e.g., you should make more money than you spend—don’t apply.
The theft was almost literal. Yammer’s product was designed to mirror Facebook, and the head of our data team was hired from Playdom, a Zynga competitor.
Good god, do not attempt to diagram this sentence.
Or at least, I’m pretty sure it cost this. And this is roughly in line with some random person’s answer to a random Quora question about a loosely related topic, so I’m gonna go with it.
The pricing structure was also wonky. Licensing fees were determined by the size of the warehouse (i.e., the number of nodes in your cluster, and the amount of memory in each node) you wanted to run, but you bring your own hardware. There were no fees for using the warehouse though; once licensed, you could pummel it with as many queries as you wanted.
This was fairly normal for the time. For comparison, a few years earlier, Teradata, another major database vendor, released a “low-cost” and “fast-to-deploy” warehouse, which you could have up and running in ninety days for $350,000.
Based on a true story.
This may have been a feature and not a bug.
Prior to the Microsoft acquisition, Google Sheets was too immature to replace Excel; after the Microsoft acquisition, Microsoft’s ego was too immature to allow anyone to use Google products.
If you ever want to humble yourself, read about semiconductors and how they’re made. One of many fun facts: They have to use precision mirrors to manufacture semiconductors. What’s a precision mirror, you ask? It’s one that, if blown up to the size of Germany, its biggest bump would be one tenth of a millimeter high. Whatever you do, don’t walk on it.