Will we ever have clean data?

Jul 28, 2023

Probably not, but maybe we can work with messy data.

6 Comments

Jul 29, 2023

"Well, it’s sorta like that. dbt models reference each in a much cruder and more brittle way than proper ...functions. They can't take arbitrary inputs, and you can only reuse outputs, not the functions themselves. For example, you can’t create a function for adding sessions to an event stream, and then reuse that same transformational logic on several different tables"

You may be alluding to this (or not) but your article seems to be in the same direction as my own thoughts in that there is an unnatural bifurcation for evaluating data at runtime (in apps, orchestrators etc.) and data at rest ( in sql engines etc) and the resulting various ways we model it. I think if we lost that distinction we'd be better off addressing the messiness you describe.

Expand full comment

Reply (1)

Benn Stancil

Jul 30, 2023

I was not thinking of that. And I'm not sure I understand why that creates weird effects?

Expand full comment

Alex

Jul 28, 2023Edited

"A high-quality dataset is one that is consistent with the business concept it represents... those business concepts are often fluid."

Great point.

Data correctness is extraordinarily difficult, and most of the time, not worth it. Do you need 100% accurate data to make decisions? What sort of decisions are you making that hinge on a 2.4% conversion rate versus 2.6%?

Of course, you don't want to mistake a 2.4% conversion rate for 240%. We want data to be "mostly correct", but not "perfectly correct" because this is an unrealistic goal that wastes tremendous time and effort. A single unaccounted-for record will jeopardize this utopia.

In practice, people want correctness for the most important figures, just like they want clear definitions for the most important business concepts. For the other 90%, people want things to be "mostly right" and "mostly clear", but some ambiguity is inevitable. As we all know, there is a velocity-quality tradeoff, and if we vigorously pursue the elusive quality at all costs, we will correspondingly sacrifice velocity.

In my experience, business teams generally prefer fast and approximate over slow and perfect. (It is doubly painful when the "slow and perfect" team ends up producing mistakes as well, just like everyone does.)

Expand full comment

Reply (1)

Benn Stancil

Jul 30, 2023

I think that's mostly true, that directional data is usually fine. But I think that saves us less than it might appear, for a couple reasons:

- When data "breaks," it doesn't break randomly. Instead, we misreport renewals of a specific type, or we don't log things from an Android, or whatever. That's a lot worse than a few random things being off, because it's exactly the sort of thing that could create some belief - that type of renewal is great, android users aren't valuable - that's wrong.

- If people find something that's off, they (reasonably) stop trusting the whole thing. "yes I know that the customer you were looking for in this dataset is missing, but trust me, all the rest are right" may be correct, but it's not a very confidence-inspiring thing for us to say.

Expand full comment

Amit

Jul 28, 2023

Check out sqlmesh and sqlglot. Similar concept and already have fairly large community including usage at some large companies: https://sqlmesh.com/ (I'm not affiliated)

Expand full comment

Reply (1)

Benn Stancil

Jul 28, 2023

Yeah, I've seen them, though I see it as that's similar, with one important difference: All of the lineage type things they do are built into sqlmesh, rather than being derived from the query directly. That's a little less interesting to me, because I suspect it tempts sqlmesh to start pushing more on their language, rather than trying to offer a way to understand what pure sql does. Maybe that doesn't happen, but if it does, a proprietary data modeling language isn't terribly novel.

Expand full comment

benn.substack

Will we ever have clean data?