17 Comments

The problem is that only checking the outputs gives you no useful information other than something changed. In order to take any sort of action you still need to know WHY it changed. Did it change because of a schema update in a database, or because someone input the wrong value in Salesforce, or because there was a business logic change somewhere in the Warehouse? Did it change because of a mistake or a purposeful decision we made as a business? It's sort of like investigating a crime and getting really good at understanding exactly when a victim died, but not how they died and what was happening at the time of death. That could be the difference between murder, an accident, or old age!

In order to report accurate numbers (or even a attempt to) an analyst would always have to still go through method #1. It's not a question of whether or not #1 is required - only how difficult is it going to be and how much information is available.

Expand full comment
author

I'd argue that the analogy is a little different, and that we currently don't do a good job of knowing if someone's dead. It's like saying we keep track of people's ages, and if they're smokers or free climbing or running drugs for a cartel. And then if they aren't, we say, yes, I think that they're alive. And then it turns out they're dead and everyone's upset and we say, "well, they _shouldn't_ be dead" and our boss says "yes but they rather obviously are dead and it doesn't seem like your actuarial tables are much good are they?"

Maybe a better way of putting it is, keeping track of inputs - creating tests and PRs and contracts and all of that - is useful for _us_, as people who need to create systems to produce reliable outputs and want that job to be easier. But they're pretty meaningless to the people who use what we make, because they couldn't care less if our tests pass or PRs are thoughtfully approved; they just want the numbers not to change. And so long as we focus on the former at the expense of the latter - or just assume, a la Bill Walsh, that the latter will take care of itself - we're going to keep breaking stuff and people are going to keep being skeptical of it.

Expand full comment

I agree that customers only care about things not changing. However, the system you're describing does nothing to address that problem - it only tells you AFTER something has broken. Garbage In / Garbage Out.

Imagine Robinhood was providing embedded dashboards for customers that showed how much money was in their account. Every time a data quality issue occurred they would issue a report with the previous balance and the new balance, letting those customers know what had changed, but never actually resolving the problem or providing any insight into why the change had occurred.

This is the OPPOSITE of what a customer wants. We would be placing the onus on them to deal with our data quality issues. They would have no idea what to even do with that information besides stop trusting the numbers.

Apply this line of thinking to Ford's manufacturing process. While it is really important to have engine checks that let a driver know if something is wrong on the road, it is 100x more important to ensure that the really dangerous things (like brakes not working or a steering wheel jamming) are caught during the production process. Does the customer really care how Ford ensures each vehicle won't explode on the highway? Nope. Is it still critical those checks are there? Absolutely.

This type of quality enforcement is tablestakes in software engineering. When I tell engineers about contracts or anomaly detection they give me weird looks like, "Wait, you guys aren't doing this already? How do you function?"

Expand full comment
author

I'd say three things to that:

1. Again, I disagree with the analogy. What we're doing currently is changing people's Robinhood balance, *and not telling them it changed.* That's my point - we break stuff, and unless we detect it in the manufacturing process, we just ship it out and assume it's good to go.

I actually had a big section in this post that I cut that made a very similar point. It said if a bank kept messing up their customer balances, they'd probably start checking that the balances were right before showing them to customers. Of course the customer doesn't want this to happen, but I'd sure rather my bank say, "hey, our bad, we think this is wrong" than it just show me random numbers and shrug because all their unit tests passed.

2. On the Ford thing, sure, but production problems that absolutely can't break or you die (unless your Tesla, and then we seem cool with it) are different than dashboards and analytical use cases. In the latter example, I'd argue that the more important thing to optimize for is longer term trust and, as Tristan put it, faith in the institution. I think we get a lot further with that if we told people when something was broken - even if we couldn't say exactly why - than we would by having them find it, even if when they do, we had marginally better explanations as to what happened.

3. The other thing is that all of this still ignores the reality that tests on inputs will always be incomplete. That doesn't mean they're not a good thing to have, but you can have lots of tests and still take in a lot of garbage (ie, what's the test that prevents the offset example?).

Expand full comment

I think you're misunderstanding my point. Robinhood's goal is to prevent the balance from ever breaking in the first place, so they have many systems in place to protect against it which is why it almost never happens. If a Robinhood customer's bank balance was always wrong their PMs wouldn't say: "Man, we need to prioritize sending out emails every week to let people know how their data changed" they'd say "Crap. We have a bug and our engineers need to wake up at 3am to go fix it."

As for point #2, that's true. When we talk about things like data contracts, they are absolutely not meant for analytical use cases in 90% of situations. This point really bears repeating. Dashboards generally don't need to be more than directionally correct. Contracts are for data products: Accounting pipelines, AI/ML models, customer facing datasets, and other operational use cases that directly make (or have the possibility to lose) the business lots of money. In fact, data quality in general is really ONLY for data products which is where accuracy has a tight correlation with revenue.

Operational use cases of data used to be the main reason companies invested millions of dollars into data architecture. Very clear ROI. Then because analytics tools were cheap and easy to implement in the cloud our infrastructure mapped to supporting them as the lowest common denominator, and when company's want to leverage their data for operational purposes later they aren't able to. The data models which support dashboards and ML are the same, and that doesn't work.

Expand full comment
author

Ah, so this comment put something together for me, which is, for analytical use cases, it doesn't matter what Robinhood does for customer account balances. That problem has very different needs - ie, it can't be wrong - and the analogies between it and an exec dashboard are similar enough to be tempting, but different enough that trying to copy it is actually bad.

This isn't just about data quality, but about how a lot of data teams operate more broadly. There's a reflexive desire to mirror software engineering, even though the problems are very different. And what that ends up doing is encouraging data teams to try to solve broken dashboards like Robinhood solves broken account balances. But those are the wrong solutions, because you don't need it to be right all the time, because you can't guarantee it'll be right all the time, and because it distracts you from better ways to do the job you're actually supposed to be doing.

(On data contracts, I agree, it's production, hence the about face for me. However, I don't think most people pay that much attention to the distinction, so in practice, I think rsta contracts contribute to this problem. Dashboard maintainers see broken dashboards, they hear data contracts are for data quality, they assume they can and should use them for fixing their data quality problem.)

Expand full comment

Exactly. So what you really need is one environment that maximizes speed, flexibility and rapid iteration. Then you need a separate environment for 'production' with contracts, quality, and ownership. Every new project starts in the prototyping environment, and once it's ready to 'graduate' there is some way to easily do that. Quality is only enforced in production-world, which makes life way easier for the data engineering team who today is trying to fix broken data everywhere instead of where it matters. If you don't really need quality, great! You can still get "hey your data changed" alerts and everyone is happy.

Terms like Data Warehouse and Data Mart are 40 years old and were literally designed to support operations, but they are slow and require data architects / governance. Whereas Data Lakes and the Modern Data Stack are built more for moving really fast and iterating. We're at this awkward transition stage where no one has quite figured out how to do both effectively in their own way.

Expand full comment

Stakeholder trust is rooted in consistency, but is that a bug or feature of the current epistemology? To the other Chad's point, making "what we said yesterday" the benchmark for "what we said today" implies that "the truth," "accuracy" and other attempts at objectivity don't really matter. At that point, why do we even go through heroic / expensive efforts to provide clean and accurate data?

Clearly, dumping raw data to CSV and building dashboards on it wasn't "good enough"; it's how we ended up with the MDS. Maybe the problem is that trust has (at least) two stages:

1) The initial validation of a dataset, where a consumer takes what they know about the business and the source systems and makes a judgment call whether or not the data "looks right." This is time-consuming and painful.

2) The ongoing (frequently daily) re-evaluation, which is almost always in relation to what came before. This stage is a necessary shortcut to trust since it's not feasible to put in the level of effort of step 1.

To the title of your piece, maybe the solution is dataset differentials, contextualized with events: upstream schema changes, upstream semantic changes, changes in operations practices, bugs, pipeline failures, etc.. It's very hard to get stakeholders to do step 1 in its entirety more than once; however, doing evaluation of the diffs is a much more manageable task.

Expand full comment
author

Your proposed solution is more or less what I want, thought I don't think all of the contextualized events are necessary. Helpful, for sure, but a secondary problem to knowing if something has changed. (And it's more or less what Datafold does, I believe, though it's framed around PRs and code edits than just general changes.)

As for the philosophical question about "truth" not mattering, I'd say sure, it does, (save footnote 5) but that's a really hard thing to measure, and something not getting revised is probably as good of a measure as anything.

Expand full comment
Jun 10, 2023·edited Jun 10, 2023Liked by Benn Stancil

I agree you should test for consistency before correctness on the outputs (much easier to tell if something is "off") and agree always go through the steps in #1 is flawed to a certain extent, as few companies can afford to chase down every DQ issue.

I'd ideally want to test for correctness mid-pipeline with basic circuit breakers as well though, as few engineers want to wait hours to find out that no source data exists for today's date or it's horribly duplicated.

Also testing consistency by automated means can be expensive and/or painful: you want test consistency before production but security doesn't want live data outside production. You can anonymize the data, but that might affect the tests and can still be a lot of effort to get signed off by security. So you end up building a staging environment (if allowed to!) for extra cost.

I feel it all comes back to the classic tradeoff availability vs correctness: what percentage do we spend our efforts on availability (making new metrics and dashboards) vs spending money on being correct?

Expand full comment
author

On your last question, I think I'd tilt very heavily in towards something like being correct, though where correct is defined as being trusted. To question Tristan was asking in his post, if people don't trust it, not much else matters. But the good news - and mostly what I wanted to say in this post - is that you don't have to be exactly correct to be trusted; you just have to be consistent. And that's probably an easier bar.

Expand full comment
Jun 9, 2023Liked by Benn Stancil

First, I think confidence in what we believe to be true is a very difficult epistemological problem. Achieving a consensual view of our systems which everyone agrees is the true state of affairs is not easy. It is not merely a technical problem. Consider the political arena today: have we achieved consensus on what the true state of affairs is? Far from it. Everyone sees different data, uses different definitions, applies different logic.

The best we can do is reconcile to as many sources as possible (which means, identifying divergences in data, definitions and logic). Just as the Internet behind the scenes is "constantly falling apart", and we solve it through error-checking (checksums, TCP retries), we should do the same with data. Its validity is constantly being tested, and we are accordingly constantly building systems to shore up our confidence in these metrics. It is not a simple problem; it is one of the main goals of a data infrastructure.

We should reconcile to historical outputs, an error from which would imply some sort of regression. But we should also reconcile with other systems (such as third-party, procured data sets), as well as practitioner intuition ("does this figure look right to you?"). We will never escape reconciliation, not in data, nor in human affairs more generally. We are constantly inquiring of others: "Why do you believe that? What data or logic have you applied that I am missing?" What we do in data infrastructure is no different - it is simply encoded as logical tests.

Expand full comment
author

Sure; philosophically, I don't really disagree with any of that. (And clearly, I have some kind of fun talking about this stuff, given that I seem to devote some portion of my life to this sort of thing.)

However, as a practical matter, we need some sort of black and white logical tests. As a data team, I can't very well tell a CEO that I don't know what our ARR is, because I'm in a constant state of perpetual inquiry and reconciliation, and we can never know what our ARR is, because there is no consensus on the true state of affairs. True as that may be, it's probably get me fired to say. Which is why, if we have to apply a logical test, my vote would be, "Is it the same as it was?" It's not perfect, but it seems better than any other such rule.

Expand full comment
Jun 9, 2023Liked by Benn Stancil

True, I would call that a single reconciliation test (does the present reconcile with the past?), and the more the better. If the CEO then checks Salesforce or some on-platform total sales figure or what is reported by finance and just one doesn't reconcile, they will instantly have doubts. Basically, every possible system they check (and person they ask) must reconcile.

Expand full comment
author

That's roughly the idea to me. If people are going to do this anyway, and make decisions on what they trust that way, shouldn't we try to do more of that preemptively? I don't think that necessarily means everything has to exactly reconcile and all of that - people understand that things can change, that systems aren't perfect, etc - but we have to at least think the way customers think. And that's not by double-checking inputs.

Expand full comment

Benn, your approach is the warrior’s approach.

Screw this notion of waiting for consensus. Lead the consensus in the gladiators’ ring. As you are.

#SpiceTradeAsia_Prompts

Expand full comment