65 Comments

Ahaha, a few years ago I was a minor stake owner of an ice cream shop and the people actually doing the hard work asked me to help them w/ my "data skillz". But since the other owners were ex-accountants, they could manage their books and costs already and there was honestly nothing data-wise worth doing.

A lot of my success as an analyst probably comes from my social science background because out of my colleagues, I'm often the first to roll up my sleeves and sit down to hand code 200+ open-ended text feedback messages in a single afternoon until my brain rots. And it's the bridging of the qual+quant that people list and stuff works out. Except a lot of people don't like the tedious work involved =\

Expand full comment

That brings up a really interesting (and maybe damning, for data) thing to me. Like, imagine that someone could actually write queries like "summarize feedback" and just get results. So you've got some KPI dashboards and some summaries from those text feedback messages, and you're trying to figure out what to do next.

I think...80%?...of the decision maker's questions would be about the text stuff? What are people saying about this? Like, every business person says "you can look at data all you want, but you have to know your customer." Everyone kind of acknowledges that's where most of the decision making power is; it's just historically been brain rotting and tedious work to get to.

Expand full comment

I think the usual way we juggle this quant vs qual thing is that qual studies are just micro-samples.. but the math still works out that if 5 of the last 7 people you talked too ALL mention the same issue, you have a really big issue. Even more strongly so if everyone mentioned it without any prompting. This technique won't let you optimize at the far margins where humans suck at articulating complicated relationships.

So maybe the new generative AI stuff that does seem to do reasonably okay-ish at "summarize this" can help knock that tedium down. Which means us quant folk have best find other ways to be useful

Expand full comment

That seems right to me. Data collection is still a problem, but if that was normalized a bit (instead of an NPS survey, click a button to record a 30 second audio review and get paid $10 bucks immediately), it feels you could almost start to run those sorts of qual studies at scale. And if you can do that, it seems like a lot of the "reading the tea leaves" elements that are currently inherent in that sort of work would go away.

Expand full comment

A hypothetical "summarize these 50k open ended responses" machine would be really interesting... and I'm pretty sure a number of people are testing out the idea right now... My suspicion is that we're going to find a new boundary where 'just ask people' methodology fails and you need the structure of quantitative count data.... gonna think on this one

Expand full comment

Fair, I could imagine things getting kinda weird if anything like this actually worked.

Expand full comment

We have actually built this machine :)

Flexor.ai

Expand full comment

What does "delivering textual data into the hands of every data practitioner" mean though?

Expand full comment

Now imagine being able to do this at global scale, across virtually all human domains. For example, imagine having access to TikTok's data warehouse.

That aside, I propose that your "magical query" technique has a lot more utility than you realize...it is possible to think in this form by simply starting off by *consciously* invoking an omniscient Oracle. In a sense, this is almost exactly what human consciousness is composed of, the difference being with consciousness, the omniscient Oracle is the subconscious mind, which unlike the conscious mind, one has no control over (including the Oracle within it).

Now imagine building a cluster of networked humans that can think in this *and other* advanced forms. Then, imagine what one could do with this power, which one could focus like a laser on any issue (say: war, economics, democracy, geopolitics, etc).

Expand full comment

That omniscient Oracle seems like basically what GPT is becoming? It obviously struggles with some stuff and isn't omniscient in a lot of ways and all of that, but it's basically 1) a giant database of everything ever written for which you can 2) write kind of generic queries like

SELECT summarize(themes) FROM books WHERE author = 'shakespeare'

It's not literally that, and you can't be that precise, and all those sorts of things, but if you squint, that seems roughly how it works?

Expand full comment

> That omniscient Oracle seems like basically what GPT is becoming?

*Kiiiinda*....but this is a bit different than what I am thinking.

Regardless: ChatGPT and others will be what they are, and ~everyone will have access to "it" (which overlooks some people will have access to non-neutered versions, in addition to entirely novel models not available to the public).

And while, best case, if these things really do turn out to become highly beneficial to humanity, everyone always overlooks one problem (the main one): we are still stuck with all of the biological LLM's running loose on the planet, and if one thinks that a (neutered) silicon-based LLM will be able to coordinate these maniacs (especially when some of them secretly have their finger on the scale(s)), I think they are going to be severely disappointed.

> SELECT summarize(themes) FROM books WHERE author = 'shakespeare'

> It's not literally that, and you can't be that precise, and all those sorts of things, but if you squint, that seems roughly how it works?

Very much agree.....but, you only get what it has to give....and what it has to give is a function of both what it was designed to give, as well as what it is allowed to give (I assume you realize reps from the various 3 letter agencies will be well embedded within OpenAI in some manner by this point - it would be dereliction of (mostly undocumented) duty to do otherwise).

Having all of that power, plus something similarly powerful (that also addresses the biological AI problem, and is beyond the control of bad actors) seems like basic prudent gameplay strategy to me. God knows humanity needs someone on their side for a change.

Expand full comment

For a moment I thought you might touch on the mystical and necessary element of belief to direct inquiry.

On a jog the metaphor of water came to me: in the Oceans of Data, there is a Sea of Information that contains Observable Truth. We can observe by putting out rain buckets (experiments) to collect water for our Data Lake. The circumstances for collecting data is an exercise in belief. In science it is the belief in a model from which we generate the hypotheses. That is an exercise in judgement. The bronze water we put into the Data Lake has contaminant artifacts of the experiment including elements of bias in the observers willingness to perceive.

In the context of your story, the unstructured video data is the qualitative data which leads us closer to the Truth or Sea of Information. From which we can also glean quantitative data once we have some basis for a model.

And then there are the red herrings such as Survivorship Bias (https://en.wikipedia.org/wiki/Survivorship_bias). In the patching and reinforcing of the bullet holes from planes that returned (rather than when they didn't return), perhaps interviewing the target demographics who aren't there would have more useful especially if the pool of people interviewed was small or not representative. This is baked into our belief about what can be True and our ability to observe

Others have said this more completely and eloquently. Thank you for the article.

Cheers,

Joe

Expand full comment

Yeah, so there are two things about this that I think are particularly interesting. First, on things like survivorship bias (and sampling biases, and all that), a lot of qualitative research is particularly vulnerable to them, because the samples are necessarily pretty small. But I do wonder if people could do this sort of qualitative research on much much larger samples, could you solve a lot of those problems? You couldn't entirely, obviously, as those biases still very much exist in data problems too. But it's somewhat of a different class of problem.

And second, for that same reason (samples are small), our default assumption tends to be that qualitative data needs to lead to quantitative data, which is where the truth is. But I'm not sure that's not not just a really strong association we've created. Qualitative data is typically small samples; quantitative data isn't; so we assume the latter is more true. But if the samples of the first were bigger, I'm not sure we wouldn't say it's just as "true."

Expand full comment

Both of your points lead to a sort of Chicken and the Egg, Which came First? question. Most definitions of Qualitative are vague to the point of uselessness, so let's use Qualitative Chemistry vs Quantitative Chemistry. Qualitative Chem is the identification of What a compound is. Quantitative is How much do we have. I don't see any reason that we cannot Quantify Qualitative data. It is a different inquiry for Truth. To me this is more along the lines of teasing out cause-and-effect rather than the quantitative mechanics of the observable signal.

The LLM and the web may be a way to do cursory Qualitative research per your notion of wider Qualitative data. How to normalize against the various self-deceptive cognitive biases would be fun. Again, it requires that initial inquiry, the "hunch", i.e. belief in order to start the search in order to see it. Confirmation Bias manifest. =) Testing the Null Hypothesis becomes a bit more fuzzy, but it seems tractable.

Onward.

Cheers,

Joe

Expand full comment

One point I suppose I should make is that really just applies to, like, making business decisions or "soft" sciences like that. If you're trying to decide, "What is the best thing for this bar?," I think there is probably more truth - ie, information that reveals the best possible thing to do - in the interviews than most of the data. But if you're doing real science like chemistry, the quantitative stuff is what matters.

Expand full comment

typo here?: "unlike quantitative observations, which can only be seen one at a time" should probably be "qualitative"

Quantitative is just aggregate qualitative. If you have a red apple and an orange you have two different qualities; but if you have different quantities of apples and oranges you can start to compare them--or at least make a nice pie chart ;-)

The encoding and "sample rate" chosen is what determines what categories (or bins) are in the survey. That's why it is important to have both open-ended surveys and validate/verify hypotheses by encoding qualitative findings into a quantitative survey.

Expand full comment

Ah, nice, good catch. Words are hard.

I'm not sure I agree about the qualitative stuff though. I said this in another comment, but if you had two pictures of two different apples, you wouldn't need to classify either picture as anything to blend them together. You could just...smash them together. But it's not literally just mixing up the pixels; it's using a whole bunch of other pictures to figure out how you can sensibly interpolate between the two.

That's basically what humans do. If we're asked to summarize ten support tickets. We compare the tickets to a bunch of other language we're familiar with, figure out what in those tickets is unique, and them kinda jumble it all up, where we highlight those unique things, bridging the language in the tickets with patterns from more general language so that it all makes sense. We can do that without ever using any sort explicit or implicit qualitative encodings. We just...kinda average the language.

Expand full comment

even math is qualitative though... we have rational, irrational, and imaginary numbers just to name a few. Yes, it might make sense to combine these sometimes but deciding whether to combine them or not means that we have some kind of methodology or decision system which helps us decide whether or not things are similar enough to be aggregated or whether they should be grouped into a different category.

The specific definition of "average" might have different definitions (geometric mean, etc) but the concept of averaging is a well-defined encoding.

Yes, it is possible to combine things without having a methodology but whatever is doing the combining is still biased by accessibility and proximity--but this system without constraints is likely to approximate cosmic latte. To be useful, any AI system needs to group things into matrices rather than a single float value. Each matrix index is a bin.

Expand full comment

Sure, the various models are all math, and all that. And there are different ways to do that math, both if the math is some LLM black box magic or if it's just math math. But a bunch of crazy matrix math with a tons of parameters is very different than something that tries to category a thing by color or some other human-understandable definition. If you add a billion bins, is that really binning?

Expand full comment

Yes, it is still binning but it might not be an interesting type of binning--it might just be an implementation detail of a higher-level system. Similar to how UUIDs are not business keys but they *might* be on a 1:1 relationship with something interesting. If a matrix is just storage it depends on the alignment with sample rate and SNR how much noise is being stored. I say implementation detail because there might be redundancies in storage, similar to how there are many synapses in humans which duplicate information... So there might be 10 bins which equal 1 category of things or 100,000 bins which points to the same thing. Do these additional bins add value? It depends. A camera might have 100 megapixels but if the lens isn't focused or uniform across the array then you'll get a lot of blurry or redundant information. But all of this is not really my main point.

You could have 1,000 surveys with many different quantitative prompts to try to understand why people don't like a bar or how you can get people to like it more but you can't send all of these surveys. Getting feedback has a cost, so how do you know the right survey to send, the right questions to ask? If you don't do a survey with qualitative prompts first then you are biased towards your ideas about what a "bar survey" should look like, or you pre-suppose all the the things wrong with the business. Maybe it is useful to validate against an existing hypothesis, but maybe its not.

I'm not disagreeing with the conclusions of your article but I'm curious how you came to those conclusions if you don't believe that quantitative encoding (counting, aggregation, modeling) builds off of qualitative selection.

Expand full comment

I think that I would agree that quant analysis (either surveys or analysis of behavioral stuff) is often best built on qual inputs (surveys, feedback, some instincts like, "this bar smells weird to me, personally"). That's the usual way this seems to work - develop a question with qual; analyze with quant; sometimes, confirm and get the "why" with qual.

My question is more, do we actually need the quant analysis? That rough arc has become so ingrained that I think we assume quant is necessary, but I'm not sure we just don't do it because of the limitations of qual: We can't do it at scale, can't do math on it, it's mostly anecdotal, etc. But if you could capture 10-100x more qual feedback and aggregate it in reliable ways, it seems like you could make a lot of decisions without the quant part.

Expand full comment

Oh wow, this difference thing is pretty cool. I do wish it was easier to do (and didn't have to be sampled like that), but seems like a really good start.

Expand full comment

Yes - I think lots of people would have an appetite for Gong without guardrails - more control with a SQL like languages and functions etc. Did we just come up with a Warehouse Native Gong competitor? 😁

It would be awesome to say - highlight the most unusual review - or show me a clustering around key topics. Like a k-mean cluster style.

It’s also interesting to give the LLM the full context on things like this and see what it could do. Like - I’m trying to make X product better. Here is what it does now - go listen to all these interviews and make it better based on your knowledge and these interviews.

Expand full comment

Ooh, that's interesting too - find me more examples of tickets like this one. And then you could mix it in with more traditional analysis, where you take that set and figure out quantitative characteristics, etc.

Expand full comment

Your "dropbox" application is a very exciting application of LLMs for me. While maybe they don't use LLMs - yet - my favorite similar example of this is Gong.io. I think they do an excellent job of helping you mine useful info from recorded calls.

However - taking this straight to SQL via databricks, bigquery, snowflake, etc would be awesome. I could actually see analysts using functions like sentiment(), summarize(), etc. I hope this promotes more customer interviews and a better cycle of getting feedback from customers incorporated into products resulting in better products. However, I wonder what nuances these systems will miss that humans would have picked up on via manual review of interviews. Back to the gong.io example - searching keywords and manually reviewing is my favorite gong.io workflow, but maybe LLMs will make this next level.

Expand full comment

*I realize I mentioned gong a lot... note I don't have any association with gong or make any money as an affiliate of gong. 🙃

Expand full comment

So that example makes me think of two things:

1. I could see that actually being somewhat analogous to how data work evolved? It's not a perfect fit, but in the early days of BI, you were somewhat restricted on what you could do; it wasn't really open-ended calculations, and especially not on raw data. But, as the tech evolved and improved, we could do that - at first, slowly, on small datasets, and on increasingly bigger ones. I could see the same thing here, where tools like Gong do it with guardrails; then, we can do it manually on small stuff; then, bigger stuff.

2. I agree that the individual examples are really useful, though it seems like we could actually do that with LLMs too? This is maybe a wilder idea, but if "summarize" is like an "average," could there be a "median?" Take these 100 tickets, and find me the one that best captures the overall themes in the full dataset. Or, like, MIN and MAX? Of these 100 review, find the 10 that are most unusual? I don't know exactly how this might work (or if an LLM could do anything remotely like it), but I'm sure if you gave researchers 1,000 video interviews and said "What can we do to help you make sense of this?," they'd come up with bunch of interesting "functions."

Expand full comment

So well said, a typical trap to fall into. Thank you for the story!

Expand full comment

This struggles from a question of categorization. Let's say you have 1000 surveys, and decide to summarize, well in that case you can imagine "summarize" as something like "apply topic model to surveys" https://en.wikipedia.org/wiki/Topic_model (using this because it's literally something that exists already, and so it's easier to know the strengths and flaws than if we just say "LLM" and assume it is magic.)

The problem though is that unless these categories are fixed in advance (in which case, this is already getting pretty close to a categorical metric) the results could shift wildly if you add 200 more surveys, and run the model again. The combination structure can just continually veer in a wildly different direction. (and we may expect that with an LLM as well)

And then, you may start to have natural questions if there are cohort differences. So, you might say "summarize (first 1000)" & "summarize (last 200)", and then that will provide a 3rd type of summary. And it becomes a judgment call whether you need 1, 2, or 3 summaries to really make sense of the problem.

And hopefully this clarifies where the problem goes. The problem with "Data" has never really been tools or numbers, but complexity management. The analyst is SUPPOSED to help the organization manage cognitive load, by hierarchicalizing information, and highlighting it in the right contexts. (Whether this happens appropriately or not is a separate question)

And if you've ever presented a data-backed story to an executive, one of the pain-points that really comes up is that this person is really needing you to pre-process the complexity for them to act as a tour-guide for how to approach the world. And I don't think this is new, even the fore-runners of analytics like the consultants and industrial engineers were just digging into a finer grain of problem to help optimize a situation.

Expand full comment

I think I'd have questions about that:

1. It makes sense to me that doing "math" on text (be it hand-wavy LLM magic or topic models or older sentiment analysis stuff or whatever) would have some weird nuances like what you're describing - but as you said, the same seems like it's true of math on numbers? I don't think either number math or text "math" makes the conclusions obvious, and there's still a lot of human interpretation to be done in both cases. But it seems to me that number math has made a working with lots of numbers a lot easier, so that we're at least able transform the numbers into something more digestible. Not conclusive, not on its own, but digestible. My question is if we can do the same with lots of text.

2. The even more hand-wavy thought is that, while the text "math" might have these weird properties (e.g., summarize(1000) + summarize(200) makes a whole new thing), that wouldn't be the case for human researchers. If you gave 20 interviews to people and said, "tell me what is important," they'd probably say pretty much the same thing if you gave them that 20 plus 5 more. Unless those 5 are really different, or make some trend obvious, in which case they might say something new - but then, that seems like the right decision? Ultimately, I think that's my thought here though: Can LLMs approximate the same results as what people would do? And if they can, that seems kinda "right?"

Expand full comment

1) We're in the same space. Where I get concerned is that when I think about most of the "math on numbers" problems, a very large % of them are descriptive, and they work in obvious ways. So, add a high number, and the average changes.

The text-math examples are more complex modeling problems where the inflection points of change are less obvious. I'm in agreement that digestibility is likely possible. I disagree with the idea of turning it into a metric.

2) "If you gave 20 interviews to people and said, "tell me what is important," they'd probably say pretty much the same thing if you gave them that 20 plus 5 more." And there may be ways of doing this with types of training for these models, as in "retain X clusters but add Y new variables".

I just know that with people, they're doing a thoughtful trade-off evaluation on their clustering approach. Maybe a ChatGPT will just have the "good enough clustering", I don't know? From my understanding of Topic Modeling is that there are several different types of approaches, and that it's still domain-specific(as in, the type of solution will need to match the type of problem) with multiple potential approaches. If reality strictly works a certain way, the same topics will always show up. However, I think this is more model-like, and less standard descriptive analysis like at this point in time. As in, a bit more "fiddly" & "hand-wavy" than the comparison set of objective numerical metrics.

Maybe I'm off-base, or I'm being thrown off by cursory research into topic models a few months back. I can (in theory) see a company getting used to this approach, but it's not obvious.

Expand full comment

1) Sure, that's fair. It's definitely not precise. You couldn't do anything properly scientific in this way I don't think. It'd really have to be more like humanities research, where there's not only variability, but sometimes outright disagreement. (Though as I say that, I do wonder if there'd be some sort of rough "central limit theorem" with this, where if you have large enough samples, every model built in broadly similar ways would converge-ish. But who knows.)

2) I could see "different models do different things" also be related to the topic modeling stuff you're describing too. Even if LLMs (and AI generally, mostly) wasn't fundamentally probabilistic, you can always get different results by asking questions in slightly different ways, training the models differently, using slightly different models, and so on. So even if one company had a standard approach for how they do it, it's almost more cultural. The research analogy might still work there: Give 20 interviews to one research team; they'll give you X back. Give 20+5 to the same team, you'll probably get Xish. Give 20 to a different team, who knows? You could get something entirely different.

Expand full comment

This is a lovely shift of perspective. Technically, I think we can solve it with just a few lines of code up to reasonable sized text input. Lets just assume you found a way to dump your raw text into a data warehouse then you can define a user defined function that calls an LLM for a summary. (This is supported by most big Data Warehouses.)

Then write a query like this:

select d1, d2, user_defined_summarize_function(array_agg(text_column)) summary_column

from table

group by d1, d2

If this needed to scale, summarize, summaries by gradually increasing the aggregation level e.g via a few CTEs. This could look as follows:

with summaries_3d as (

select d1, d2, d3, user_defined_summary_function(array_agg(text_column)) as summary_column

from table

group by d1, d2, d3)

select d1, user_defined_summary_function(array_agg(summary_column)) as summary_column

from table

group by d1)

Of cause this does not feel as native as LLM based aggregate functions, but I would assume and hope it's just a matter of time until it becomes available.

Expand full comment

Yeah, it occurred to me that you could do something like that, which seems like it's mostly work? But it seems like there'd be a bunch of other interesting things you could do (in theory, anyway) if it was real LLM type stuff. For example, is there a MEDIAN(text) function, where you could say, look at these 100 pieces of feedback and find the most representative one? I don't want something new, I want the exact words a customer said. But I want the one that best captures the overall themes of all 100.

Is that useful? I don't know. But I think people tried stuff like this for a bit, we'd definitely find some useful functions like that.

Expand full comment

Yes in retrospect, I've learned Halloween is for kids

Expand full comment

“Michigan University”? As MSU grad, I like it.

Expand full comment

If you beat an undefeated Ohio State team, I guess you can get away with it.

Expand full comment

They are doing this against saved verbal conversations, at scale and developing their own algorithms to interpret, then feed back to clients suggestions to improve. I loved the comment their CMO said, "Surveys only capture those who love you or hate you, not much in between.

Expand full comment

Ah interesting, that seems kinda cool. And yeah, I was talking to someone else about this - if you can more easily use this sort of information, it seems like companies would find lots of ways to collect a lot more of it, and in a way that avoids that "love you or hate you" problem.

Expand full comment

PS. Running a bar is starting to sound a lot more satisfying than doing data work.

Expand full comment

Yes, so long as your bar doesn't actually have a board of directors.

Expand full comment

My bar will be as un-corporate as it can possibly get.

Expand full comment

You do you; my favorite bar is Alcohol & Co. Ventures, LLC.

Expand full comment

You'll still be welcome at mine, Corp Bro.

Expand full comment

Thanks, Salt of the Earth Man.

Expand full comment

Really enjoyed this one. If AI can eventually do that, really valuable.

Expand full comment

This is where I think LLMs could (could?) go a lot further though. There have been NLP-ish tech for a while that can do sentiment analysis or classification for a while, but that's more akin to me to turning unstructured data into structured data. Can we bypass that step entirely, and run operations directly on top of the unstructured stuff?

It's like the AI things that generate composite images. They don't quantify various features from two pictures and say, now make a picture that has these the average of these measures. They just...smash the two together. Which...seems like a wild thing to do with text, but also seems like it basically describes how LLms work?

Expand full comment

🔥 the Daniel Plainview reference 🔥

I dressed as him once for Halloween, caught a few fans with the bloody bowling pin

Expand full comment

"Mom there's a crazy guy out here bludgeoning kids with a bowling pin so that he take all the candy in the neighborhood"

Expand full comment