AVG(text)
A data intelligence platform doesn't turn natural language into math. It runs math on natural language.
Imagine that you finally did it. You left your tech job—high-paying, ostensibly prestigious, amenity-maxed but catatonic—for the tactile life, and opened a bar.
However, as both a former enlightened data professional and a hungry overachiever with a deep-rooted need to win, you couldn’t go full analog. If you’re going to run a bar, it’s going to be the best—and the best businesses use data. Your Square account streams every transaction into Databricks. Hundreds of tiny scales constantly measure the weight of every liquor bottle, providing by-the-minute snapshots of your current inventory. You track foot traffic into your bar; you track how loud every corner is; you track all the climate readings from your Nest. You purchase demographic data about your customers. You collect market data to find popular liquors and beers. You build custom pipelines that trawl TikTok and YouTube for viral new cocktails. You keep your data in pristine condition, perfectly modeled, fully documented, constantly monitored and exhaustively tested. Your dashboards are both seasonally adjusted and weather-adjusted, using local temperature and precipitation data bought from the Databricks Marketplace. Your bar is a digital native; a smart bar; a bar built for the internet of things.
Your bar is also failing.
It’s never empty, but rarely full enough to cover your costs. Stressed by the bar’s struggles, you take a week off to clear your head. While you’re gone, a friend decides to help out. They offer a free beer to any customer in exchange for two minutes of stream-of-consciousness feedback, recorded by a video camera set up in a booth in the back. Over seven days, for a few thousand dollars, they record about 750 videos.
You get two texts when you get back from your vacation. One, from your friend, sharing their research, with a link to a Dropbox folder full of customer interviews. And two, from your bar’s board of directors,1 asking for a quick call, with a link to a Google Meet.
They don’t fire you though—not yet. Instead, they give you one more chance. Tomorrow morning, they want to see a comprehensive plan for turning the bar around. Show us that you understand what’s not working and you know how to fix it, they say. Otherwise, we’ll do what we’re here to do.
So here’s the question: If you had a day to save your job, where would you go looking for answers? In your comprehensive database of financial indicators, operational KPIs, customer behaviors, and market trends, where every metric and insight is a query away? Or in 25 straight hours of unedited video interviews?
Two things about this situation seem obvious. First, the actual solution to your bar’s problems is in the customer interviews. Your bar probably isn’t failing because of some nuanced secret that’s buried in your inventory logs; your bar is probably failing because customers don’t like it. And you’re much more likely to find the reasons that they don’t like it by listening to 750 people tell you why they don’t like it than you are by spelunking through in a database of quantitative behavioral exhaust.
And second, despite that, you’d probably use the data, not the interviews. You have 24 hours! There are 25 hours of videos! You only have time to watch a few! You would be insane—and your board would be insane not to fire you—if you proposed a bunch of changes based on a handful of randomly selected snippets of feedback.2 Important boardroom presentations should be full of charts and scientific analysis, not haphazardly extrapolated pull quotes.
In countless ways, both explicit and implicit, this is the lesson we’ve been taught about business decision-making: Quantitative rigor beats qualitative estimation. We hear this so much that it’s easy for this axiom to blend into another one: Data is objective, and words are anecdotes and hearsay. Databricks deals in facts; Dropbox videos deal in feelings. “Without data,” the saying goes, “you're just another person with an opinion.”
But I’m not sure that that belief—that data is the capital-T Truth—is actually why we’ve come to rely on quantitative over qualitative information. Suppose, for instance, that instead of interviewing 750 people, your friend surveyed 750 people. Each person answered a couple dozen multiple-choice questions about themselves and their feelings about the bar. They rated your drinks on a Likert scale; they chose, from a list of possible options, the main reasons they wouldn’t recommend it to a friend. This information—this data, just as “untruthful” as the interviews, but quantified and structured—would almost certainly be included in your board presentation.
Why? Not because it’s more accurate than the interviews, but because it can be aggregated. Each survey response is just as anecdotal as an interview; the difference is that data can quickly be combined into averaged and summarized forms.
That, I think, is actually what makes data powerful. We can perform mathematical operations on it. A billion data points can be collapsed into a one measure that blends every individual number into a single composite. Statisticians don’t manually study each record, pull out a few representative ones, and present them as an approximation of the world.3 No, they average them, in a way that the opinion of every data point is incorporated precisely in accordance with its importance.
The only way we can describe an entire corpus of videos is by pulling out a few examples—which is imprecise—or by having a researcher meticulously study and summarize them—which is expensive, slow, and subject to the whims, skills, biases, and memory of the researcher. Though the raw material in that Dropbox file is probably more valuable than the raw material in your Databricks database, we can’t easily mine or manipulate it; we can only sample it. That’s why we instinctively dismiss this sort of information as untrustworthy or biased: Not because it’s wrong, but because there’s no way to look at all of it at once.
In other words, data hasn’t been put on a corporate pedestal because its contents are more valuable than the contents of customer interviews, or support tickets, or online reviews. It’s been put there because, unlike qualitative observations, which can only be seen one at a time, data can be summed into an elephant.
Breaking: Oil companies oversell oil
In this sense, the data industry is an industry built on convenience. Companies have two collections of information—quantitative versus qualitative; numbers versus words; structured versus unstructured. Both are valuable, but only one of those collections is easy to manipulate and aggregate. To warp the analogy about data being oil, quantitative data is accessible oil. Qualitative data has the same potential energy, but is buried deeper in the earth, and requires more advanced technology to refine. So, we built an industry designed to extract the stuff that we can use. And just as energy companies became oil companies, companies in the business of providing insight and intelligence became data companies.
Initially, the focus made sense. Companies should carefully track what they sell, measure how efficiently they operate, and look for useful patterns in the digital footprints of their customers. If you were trying to fix a bar, foundational financial data about how much money the bar makes and loses probably is more important than a bunch of customer interviews. Without those numbers, you wouldn’t even know it was failing, much less how to fix it.
As we’ve gotten better at this basic stuff, the data industry—which has spent years promoting the importance of data-driven cultures, and commissioning studies to show that data-driven companies are more profitable—has been forced to come up with more creative ways to wring meaning from our raw material. We’ve sold people on the idea that they need more data; that they need it to be faster and updated more often; that they need to summarize and analyze it with increasingly exotic math.
But this has diminishing returns. Could you fix the bar with hourly precipitation records, or a complex causal model that correlates demographic data with purchasing behavior, or a multidimensional real-time pricing engine? Maybe? Probably not?4 But even if you could, it would take a lot of energy, money, and expertise.5 From this blog a year ago, another oil analogy:
The data of a mid-sized B2B SaaS product simply doesn’t have the potential energy of Google’s search histories, or of an Amazon’s browsing logs. If the latter examples are the new oil, the former is a new peat bog. No matter how good the tools are that clean and analyze it, how skilled the engineers are who are working on it, or how mature the culture is around it, it’ll never burn as hot or as bright…
[W]e assume that there are diamonds buried in our rough data, if only we clean it properly, analyze it effectively, or stuff it through the right YC startup's new tool.
But what if there aren’t? Or what if they’re buried so deep that they’re impractical to extract? What if some data, no matter how clean and complete it is, just isn’t valuable to the business that owns it?
Put differently, though data has clear value, at some point, we’re either scraping the bottom of the barrel for residue, or filling it up with diluted, low-octane backwash. That’s not to say there isn’t any energy left in there, but the effort-to-extracted-value curve is logarithmic. The work won’t always be worth it.
Last year, I thought the solution to this problem was to do simpler things:
I think we have to be more targeted in our ambitions, as both data teams and as data vendors. Focus on proven tooling and use cases—reporting, dare I say decision support—over moonshots. Focus on identifying which few datasets have real value, rather than assuming it’s all of them.
I’m not sure this is right anymore. Instead, what if we went looking for a better reservoir?
Microsoft Wordcel
This query is fake:
SELECT customer_age_group,
SUMMARIZE(transcript) AS summary,
FROM bar_interviews
GROUP BY 1
But it’s not entirely fake. It’s just not a query that you can run. Instead, it’s a question that an executive might hand to a highly-paid research team, and say, “I want to know what different age groups think of our bar.” Weeks and tens of thousands of dollars later, the research team would come back with a presentation with a few slides that summarized what customers said about the bar, and how those things differed by age.
Then, the executive might say, “Ok but how does it change if we just look at people who like the bar? What about the people who dislike it?” The research team would then go back over the transcripts and their notes, and try to figure out why some people in different age groups liked the bar and why some disliked it. They would, very manually and very slowly, “run” this “query:”
SELECT customer_age_group,
EXTRACT_SENTIMENT(transcript) AS sentiment,
SUMMARIZE(transcript) AS summary,
FROM bar_interviews
GROUP BY 1,2
But…what if…we could actually run these queries, quickly, on a computer? What if we could aggregate unstructured, qualitative data in the same way we can aggregate structure, quantitative data? What if there were functions that could summarize text in the same way there are functions that average numbers?
With LLMs, there are? In some rough sense, LLMs train themselves on libraries of text, and then regurgitate an approximate average of what they read. They are an aggregate function for unstructured data, capable of producing a composite view of something that could previously be summarized only through human reasoning and research.
Could they…go even further, and become a form of statistics, but for text? Summarize two populations of transcripts, and find the main differences between the two groups. Compare the distance between those two groups to the distribution of opinions across the entire population, and figure out if the difference between the two groups’ feedback is “significant.” Do this over time, and see how feedback evolves from month to month.
Though we take statistics and the ability to perform near-immediate computations on enormous datasets for granted, this wasn’t always possible. There was a point when our methods were basic, and our work had to be done by hand. Is it so hard to imagine that we may one day be able to quickly aggregate and analyze text in a similar way?
The end of Big Structured Data
Two weeks ago, Databricks presented a new roadmap. In it, they did what data companies always do: Made promises about making data even more valuable. Generative AI, they said, would create a “new wave of unified platforms that deeply understand an organization's data.” Automatically generated metadata and semantic models will keep things organized; text-to-SQL capabilities will make asking questions easy; AI agents will automatically optimize the system’s performance.
Sure, fine. Perhaps generative AI can make us smarter by keeping our data cleaner for us, by layering some sort of semantic understanding on top of it, and by performing complex combinatorial calculations to find hidden patterns we couldn’t find ourselves. But I think this is the wrong emphasis, a step in a tired direction, and a waste of LLMs’—and Databricks’!—real potential.
Databricks can warehouse unstructured data. They can train and fine-tune custom LLMs. They even have an AI-powered summarize function. Databricks could actually revolutionize what we can do with data, but it won’t come from flashy marketing that promises to extract a few more drops of energy from the oil wells we’ve been mining for decades. It’ll come from pointing AI at a new well—the unstructured video interviews in Dropbox6—and letting us do the basic things with it that we’ve never been able to do before.
Why is there a board? Because it’s a bar, not a $32 billion multinational financial organization.
“Trevor loved White Claw, so we’re installing six hard seltzer taps. And Cheryl was from Ann Arbor, so we’re going to rebrand as Michigan University bar.”
“Gail spent 43 dollars at the bar, and Pavan spent nine. Based on these and the fifteen other transactions we looked at, these seem like our typical customers, so we’re going to report that our average revenue per customer for this quarter is about 25 dollars.”
There’s nothing in that data, but matter and emptiness.*
*This is, of course, a footnote to tell you that Griff is dropping a new song on Wednesday.
I mean, sure, a bar is a simple business; to say a bar gets diminishing returns from real-time inventory data isn’t the same as saying Amazon gets diminishing returns from real-time inventory data. Fair. However, Amazon definitely gets diminishing returns somewhere; it’s just a question of when.
Moreover, even in huge quantitative companies, data isn’t as prophetic as we might like to imagine. If any business was both willing and able to spend massive amounts of money to extract every possible penny from data, it would be a hedge fund. And yet, according to recent reports, Bridgewater—the world’s largest head fund—has “no grand system, no artificial intelligence of any substance, no holy grail” that powers its trades. Instead, its portfolio is governed by “a series of if-then rules” that “dealt simply with trends.” For example, “one such if-then rule was that if interest rates declined in a country, then the currency of that country would depreciate, so [Bridgewater] would bet against the currencies of countries with falling interest rates.”
This is also a very lucrative well, apparently.
Ahaha, a few years ago I was a minor stake owner of an ice cream shop and the people actually doing the hard work asked me to help them w/ my "data skillz". But since the other owners were ex-accountants, they could manage their books and costs already and there was honestly nothing data-wise worth doing.
A lot of my success as an analyst probably comes from my social science background because out of my colleagues, I'm often the first to roll up my sleeves and sit down to hand code 200+ open-ended text feedback messages in a single afternoon until my brain rots. And it's the bridging of the qual+quant that people list and stuff works out. Except a lot of people don't like the tedious work involved =\
For a moment I thought you might touch on the mystical and necessary element of belief to direct inquiry.
On a jog the metaphor of water came to me: in the Oceans of Data, there is a Sea of Information that contains Observable Truth. We can observe by putting out rain buckets (experiments) to collect water for our Data Lake. The circumstances for collecting data is an exercise in belief. In science it is the belief in a model from which we generate the hypotheses. That is an exercise in judgement. The bronze water we put into the Data Lake has contaminant artifacts of the experiment including elements of bias in the observers willingness to perceive.
In the context of your story, the unstructured video data is the qualitative data which leads us closer to the Truth or Sea of Information. From which we can also glean quantitative data once we have some basis for a model.
And then there are the red herrings such as Survivorship Bias (https://en.wikipedia.org/wiki/Survivorship_bias). In the patching and reinforcing of the bullet holes from planes that returned (rather than when they didn't return), perhaps interviewing the target demographics who aren't there would have more useful especially if the pool of people interviewed was small or not representative. This is baked into our belief about what can be True and our ability to observe
Others have said this more completely and eloquently. Thank you for the article.
Cheers,
Joe