Discussion about this post

User's avatar
Chad Sanderson's avatar

The problem is that only checking the outputs gives you no useful information other than something changed. In order to take any sort of action you still need to know WHY it changed. Did it change because of a schema update in a database, or because someone input the wrong value in Salesforce, or because there was a business logic change somewhere in the Warehouse? Did it change because of a mistake or a purposeful decision we made as a business? It's sort of like investigating a crime and getting really good at understanding exactly when a victim died, but not how they died and what was happening at the time of death. That could be the difference between murder, an accident, or old age!

In order to report accurate numbers (or even a attempt to) an analyst would always have to still go through method #1. It's not a question of whether or not #1 is required - only how difficult is it going to be and how much information is available.

Expand full comment
Chad Isenberg's avatar

Stakeholder trust is rooted in consistency, but is that a bug or feature of the current epistemology? To the other Chad's point, making "what we said yesterday" the benchmark for "what we said today" implies that "the truth," "accuracy" and other attempts at objectivity don't really matter. At that point, why do we even go through heroic / expensive efforts to provide clean and accurate data?

Clearly, dumping raw data to CSV and building dashboards on it wasn't "good enough"; it's how we ended up with the MDS. Maybe the problem is that trust has (at least) two stages:

1) The initial validation of a dataset, where a consumer takes what they know about the business and the source systems and makes a judgment call whether or not the data "looks right." This is time-consuming and painful.

2) The ongoing (frequently daily) re-evaluation, which is almost always in relation to what came before. This stage is a necessary shortcut to trust since it's not feasible to put in the level of effort of step 1.

To the title of your piece, maybe the solution is dataset differentials, contextualized with events: upstream schema changes, upstream semantic changes, changes in operations practices, bugs, pipeline failures, etc.. It's very hard to get stakeholders to do step 1 in its entirety more than once; however, doing evaluation of the diffs is a much more manageable task.

Expand full comment
15 more comments...

No posts