23 Comments
Sep 23, 2022Liked by Benn Stancil

So are data contracts basically what we'd have handled in "the old days" with CHECK constraints and RI on the destination table? Define what is logically acceptable in the DDL and let the RDBMS keep bad data out at load time, and make sure that the tool loading the data from raw understands how to handle constraint violations?

Expand full comment
Sep 30, 2022ยทedited Oct 1, 2022Liked by Benn Stancil

Two thoughts.

First, there is a category of issues around source data that cannot be detected nor verified with automated testing. It can't be known by the recipient of the data unless the provider of the data informs them of it. And this is related to a category of statements around data contracts that by definition must be at the business meaning level and not the technical level.

Let me give an example of an inter-company interface that we've built recently which relates to the business-meaning level of data contracts.

We have a client for whom we have built an analytics platform which has, among its data sources, payroll information. The payroll information we receive has to do with employees working on contracts (I know, employees working on contracts seem strange, but bear with me, it's an unusual kind of payroll situation, and I can't give more business detail without providing inappropriate information.)

So, the incoming data looks a bit like this:

EMPLOYEE_ID | CONTRACT_ID | BUNCH_OF_OTHER_FIELDS

It is provided in our Snowflake environment via a Snowflake data share (private) and the vendor does a batch recalculation of the payroll data every night.

The payroll data is supposed to provide one record every time an employee starts on a new contract. So the grain is one row per EMPLOYEE_ID per CONTRACT_ID. The business meaning of this is that an employee can be working on multiple different contracts. Any new contract is supposed to be defined by a new combination of EMPLOYEE_ID and CONTRACT_ID. There can be updates to the various other fields associated with a contract such as the contract start date, end date, payment rate, location, etc. When a contract is updated, theoretically, the fields associated with the same EMPLOYEE_ID and CONTRACT_ID should just be updated.

Also, the payroll vendor does have a fairly normalized schema upstream of a number of different tables which are the back-end of their payroll processing system, but they will not provide that for various legal reasons (which don't make sense, and we've tried to get it, but they refuse.) They only provide this one denormalized table.

Well, we recently found out that sometimes the data entry team at the payroll vendor, instead of editing existing contract data, sometimes just creates a new record - same EMPLOYEE_ID but new CONTRACT_ID. And we also continue to get the old record. Records don't age out of the feed until a couple of years after the contract date. What makes this even more challenging is that it is very possible, and fairly common, for an employee to legitimately be working on two or more different contracts. We can't reliably tell the difference between what is actually a true, new contract, versus what is actually a change to an existing contract. Yes, we could try to get into all sorts of fancy comparisons on other fields, but that is error-prone and unreliable, and any rules we create around that could also break as the habits of upstream data entry folks change. We also don't get actual database insert or update timestamps on the data, and because it is denormalized down from quite a number of source tables in a way that is not transparent to us, it wouldn't really help anyway.

Thus, there's no reliable way for us to write a rule to detect when the upstream payroll vendor is erroneously entering contract changes as new contracts instead of editing existing contracts as they should. Technically, everything is correct - there are no duplicates when looking at the compound primary key of EMPLOYEE_ID and CONTRACT_ID. But - the business rule, the spirit of the thing, is being violated. I don't see a way to automate this or even detect it on our end except for actually getting the data provider to agree to the spirit of the data contract, and to put effort into working with their data entry team, and also monitoring what they are entering on an ongoing basis to make sure that it is correct.

Second thought.

I also think that discussing how data contracts work WITHIN an organization - where there can be better informal agreements and negotiation - is a very different conversation regarding how data contracts can/need to/must work BETWEEN entirely different organizations. You can do a lot more legwork and relationship building where, when being within a given organization, team members at least theoretically share some sense of mission, allegiance, or goal. When going entirely outside of your organization and to an entirely different organization, and needing to build an ingest or feed with a whole different firm that likely has an entirely different set of incentives, timelines, goals, and personnel, it gets a lot tougher. My thoughts here on building data interfaces between organizations here, which are, unfortunately, a lot more formal and perhaps even litigious, than yours: https://jrandrews.net/risks-of-interfaces-with-partners/

TL;DR - IMHO between organizations you really need to have an actual *legal* contract, with the technical folks involved in the negotiations and not just the attorneys, and there needs to be actual specific financial penalties for each enumerated breach of contract, to really push large organizations to work together.

Expand full comment

The proposed data contract arch feels like a combination of introducing version control and schema validation into general data management. I'm actually building something similar.

With this staging and schema validation approach, it feels like data producer and data consumer will author their own schema and both write to staging store, do the test/validation before publishing/promoting to production. However, this seems make the transformation process (dbt) to be responsible to maintain the compatibility between data producer and consumer as any schema change from either side could break the transformation. Fixing such issue could become a headache as the ownership of this transformation could be split between data producer and consumer.

Expand full comment
Jan 9, 2023Liked by Benn Stancil

Hi Benn - I thought I'd add to the Data Contract discussion and I thought you might be interested :) https://medium.com/@maxillis/on-data-contracts-data-products-and-muesli-84fe2d143e2c

Also references your 'better calendar' post

Expand full comment
Sep 23, 2022Liked by Benn Stancil

This essay provoked me so deeply my intended "comment" evolved into a full-fledged blog post:

https://ihack.us/2022/09/23/beyond-data-contracts-a-response-to-benn-stancil/

Would love your feedback!

Expand full comment