So are data contracts basically what we'd have handled in "the old days" with CHECK constraints and RI on the destination table? Define what is logically acceptable in the DDL and let the RDBMS keep bad data out at load time, and make sure that the tool loading the data from raw understands how to handle constraint violations?
Basically? Most old architectures are probably good, actually. When we reinvented a lot of data tools, we ended up throwing out a lot of babies with the bathwater (eg, semantic layers). That doesn't we should reincarnate the old stuff exactly as it was. But the fact it was done similarly in the past is probably a good sign rather than a bad one.
But what about when source systems change semantics without changing schema? A field formerly represented stock in each and now represents stock in 100. A field used to contain a 10-character alphanumeric string as an identifier, but that has been expanded to 15 characters. These kinds of semantic shifts can be virtually impossible to detect early without sophisticated, column-level anomaly detection. I hate the idea that the "solution" is to throw engineering hours and cloud spend at what's fundamentally a change management problem.
Absolutely nobody would be fine if IT changed everyone's Salesforce logins over the weekend with no notification. So why can sales ops add or remove fields or completely change field behavior without notifying other stakeholders in the business? I understand this is how things have worked in the data space basically forever, but can't we strive to be better?
Absolutely, they *should* tell people; I'm not arguing against that. But I think that should be the extent of their responsibility. If it's better for sales to represent stock in 100s, they should do that without needing to consult anybody.
More broadly, I fully agree that this kind of "semantic observability" is the really hard problem. And I don't really have any idea how we deal with it: https://benn.substack.com/p/semantic-observability
But doing so impacts the reports that purchasing and warehousing rely on to do their jobs, or it impacts financial reports that ultimately surface to C-level or the street. A small change can cause dozens, hundreds, or thousands of hours of work downstream. How do we handle this very real burden data source owners are putting on the rest of the business?
But I think your point is that this ripple effect is in itself dysfunctional. Does it really make sense that Bob in sales ops can cause 50 different reports to fail by deprecating a custom field? Why are we even in the position where teams who are just using software (albeit complex enterprise systems) can cause widespread damage through routine changes?
As you said, there's no easy fix. Change management is burdensome, but the fallout from allowing teams to eschew it is huge. On the other hand, the virtually exclusive model for populating a data warehouse / lake / lakehouse is to dump a bunch of operational data into it. I'm sure that Bob would be more than happy if we all got out of his business and let his team operate from his ERP's reporting system.
To me, that's the exciting thought behind data contracts: that we can start to build software that decouples operational data stores from analytical outputs, and we can have the people best equipped to define and manage semantics (product owners, applications developers, and systems operators) to help establish standards for what information their systems will emit.
The million dollar question is how we ride out the wave until we have practices and technologies that actually get us there. I'll anxiously be awaiting the answer in next week's post!
That seems like it's mostly just damning of modern data stack as a whole. Which might be the real question in all of this. If something can fall apart so easily, or we need this entire system of rules and governance to prevent one person from breaking everything, the problem might not be the rules or the person who broke it; it might be the thing itself.
I don't know what we do about that either though. It seems like our current approach is to keep throwing more supports in various places (more tests, observability, data contracts, etc) and hope we hold it all together. Though maybe that's the whole pitch behind data contracts (and to some extent, the data mesh) is what you actually need is an architectural change, not another testing framework.
First, there is a category of issues around source data that cannot be detected nor verified with automated testing. It can't be known by the recipient of the data unless the provider of the data informs them of it. And this is related to a category of statements around data contracts that by definition must be at the business meaning level and not the technical level.
Let me give an example of an inter-company interface that we've built recently which relates to the business-meaning level of data contracts.
We have a client for whom we have built an analytics platform which has, among its data sources, payroll information. The payroll information we receive has to do with employees working on contracts (I know, employees working on contracts seem strange, but bear with me, it's an unusual kind of payroll situation, and I can't give more business detail without providing inappropriate information.)
So, the incoming data looks a bit like this:
EMPLOYEE_ID | CONTRACT_ID | BUNCH_OF_OTHER_FIELDS
It is provided in our Snowflake environment via a Snowflake data share (private) and the vendor does a batch recalculation of the payroll data every night.
The payroll data is supposed to provide one record every time an employee starts on a new contract. So the grain is one row per EMPLOYEE_ID per CONTRACT_ID. The business meaning of this is that an employee can be working on multiple different contracts. Any new contract is supposed to be defined by a new combination of EMPLOYEE_ID and CONTRACT_ID. There can be updates to the various other fields associated with a contract such as the contract start date, end date, payment rate, location, etc. When a contract is updated, theoretically, the fields associated with the same EMPLOYEE_ID and CONTRACT_ID should just be updated.
Also, the payroll vendor does have a fairly normalized schema upstream of a number of different tables which are the back-end of their payroll processing system, but they will not provide that for various legal reasons (which don't make sense, and we've tried to get it, but they refuse.) They only provide this one denormalized table.
Well, we recently found out that sometimes the data entry team at the payroll vendor, instead of editing existing contract data, sometimes just creates a new record - same EMPLOYEE_ID but new CONTRACT_ID. And we also continue to get the old record. Records don't age out of the feed until a couple of years after the contract date. What makes this even more challenging is that it is very possible, and fairly common, for an employee to legitimately be working on two or more different contracts. We can't reliably tell the difference between what is actually a true, new contract, versus what is actually a change to an existing contract. Yes, we could try to get into all sorts of fancy comparisons on other fields, but that is error-prone and unreliable, and any rules we create around that could also break as the habits of upstream data entry folks change. We also don't get actual database insert or update timestamps on the data, and because it is denormalized down from quite a number of source tables in a way that is not transparent to us, it wouldn't really help anyway.
Thus, there's no reliable way for us to write a rule to detect when the upstream payroll vendor is erroneously entering contract changes as new contracts instead of editing existing contracts as they should. Technically, everything is correct - there are no duplicates when looking at the compound primary key of EMPLOYEE_ID and CONTRACT_ID. But - the business rule, the spirit of the thing, is being violated. I don't see a way to automate this or even detect it on our end except for actually getting the data provider to agree to the spirit of the data contract, and to put effort into working with their data entry team, and also monitoring what they are entering on an ongoing basis to make sure that it is correct.
Second thought.
I also think that discussing how data contracts work WITHIN an organization - where there can be better informal agreements and negotiation - is a very different conversation regarding how data contracts can/need to/must work BETWEEN entirely different organizations. You can do a lot more legwork and relationship building where, when being within a given organization, team members at least theoretically share some sense of mission, allegiance, or goal. When going entirely outside of your organization and to an entirely different organization, and needing to build an ingest or feed with a whole different firm that likely has an entirely different set of incentives, timelines, goals, and personnel, it gets a lot tougher. My thoughts here on building data interfaces between organizations here, which are, unfortunately, a lot more formal and perhaps even litigious, than yours: https://jrandrews.net/risks-of-interfaces-with-partners/
TL;DR - IMHO between organizations you really need to have an actual *legal* contract, with the technical folks involved in the negotiations and not just the attorneys, and there needs to be actual specific financial penalties for each enumerated breach of contract, to really push large organizations to work together.
On the second point, I agree - it's a very different game between companies than within them. Which is part of why I'm somewhat skeptical of the whole contract architecture in the first place, because it applies something that's meant to govern two somewhat adversarial parties to a relationship that should be more harmonious. It won't always be, sure, but the better solution seems to just be something like, "work together and get along." Just as we don't have to use parliamentary procedure to talk over the dinner table, teams shouldn't need legal-like contracts to work together.
On the first point, I'm kinda mixed on this. There's an even simpler version of this that nearly every team deals with: Sales people have to enter contract data into Salesforce and they type it manually. They can just get it wrong. No contract or test will catch when a rep inputs that they sold a deal for $10,000 that was actually worth $15,000. The rule is basically, do it right.
On one hand, that problem *has* to live with the data producer. On the other hand, if we want to claim that "get it right" is a contract, that really blurs the line for me for what a contract actually is. Moreover, while we (as data people) can't really test for that problem, I still think it's on us to make what we have as robust as we can. There are plenty of things that we can catch that aren't like this, or like the example you described, and I think we should worry about that a lot more than defining these legal interfaces with other teams.
What does it mean to "work together and get along?" Maybe this is a mental model problem. There are two models that I can think of for corporations of any real significant size to receive services from an external entity:
1. A transactional negotiation, so, a utility. Power, water, mail/shipping services, etc. Terms are relatively concrete, definable, granular, and discrete. Kilowatt/hours, gallons, packages delivered within a given timeframe, etc.
2. A "relational" negotiation. Where the terms of the agreement are difficult to entirely provide numerically, there is some level of understanding required, and the whole is greater than the sum of its parts. A few examples of this - hiring outside legal counsel, hiring an outside accounting firm, anything regarding data...
When we sell things to others, we always want a relational sale because it is stickier and it is harder for the buyer to disconnect. But, as buyers, we usually want transactional relationships because we want simplicity, greater objectivity, and more optionality to switch to another provider of the service if the seller is not providing what we want.
As buyers, do executives and decision-makers (who often are not technical) want #1 or #2? They will always gravitate to #1.
"Relational relationships" take a lot more effort than transactional ones do, and all executives become exhausted with negotiation and relationship-building. Over time, the draw, in all spheres, is to move towards saying "can't it just be simpler" (regardless of whether it can or not), and also "I don't understand and because I don't understand you should make it simpler so I can." We will never get to a world where all executives are truly data-savvy, and I'm not sure we should either, because understanding other areas of a business are just as important.
So by saying "we should work together and get along", if I understand correctly, we are saying "you should commit over time to spending more effort and energy and be more entangled with external 3rd parties." No decision-maker, particularly no non-technical decision-maker, is going to choose that unless they absolutely have to, and even if they "have to" from an objective perspective, if they don't understand why they have to, they'll still try to make alternative choices. I understand why it is hard to boil down data interfaces to transactional relationships, for the reasons we have discussed above, but it was also hard to boil down many other now-transactional services to become transactional when they started in the early days. Thus my blog post was an attempt to start to boil down interfaces to transactional relationships, depressing though the contractual language may be.
That feels overly pessimistic to me. I agree that, at certain scales, we need the transactional relationships. But at smaller scales, I think it's actually far more expensive to build that transactional relationship than it is to build the relational one.
Take the extreme example, of a data person and a business person working together to solve a specific problem. The collaboration model there is, "get in a room and figure it out." That's much faster and much easier than trying to build some formal transactional relationship where they only communicate in very concrete ways. At that scale, I'd argue the opposite point from what you're saying: No decision maker (or reasonable person) is going to choose the transactional relationship unless they absolutely have to.
My broader point, then, is that things like data contracts (and the data mesh) are often over-engineered solutions that are probably necessary at very large scales, or between somewhat adversarial parties (like 3rd parties), but aren't necessary for working teams. I do think you're on to something about a lot of people wanting to avoid relational relationships, but I don't think that makes that behavior right. I think it's because people are mostly lazy and don't want to do the hard work of building the relationship, and think they can shortcut it with rules. In practice, though, that's not how collaborative organizations work. Sometimes, the thing people don't want to do is the thing they should do.
I think we are actually more agreeing then disagreeing. I was more specifically talking about the larger-scale inter-company (not intra-company) kinds of things that need to persist past individuals and for long periods of time - like at least a few years.
The proposed data contract arch feels like a combination of introducing version control and schema validation into general data management. I'm actually building something similar.
With this staging and schema validation approach, it feels like data producer and data consumer will author their own schema and both write to staging store, do the test/validation before publishing/promoting to production. However, this seems make the transformation process (dbt) to be responsible to maintain the compatibility between data producer and consumer as any schema change from either side could break the transformation. Fixing such issue could become a headache as the ownership of this transformation could be split between data producer and consumer.
Yeah, agree that it could be a headache, but what's the alternative? Plus, having a kind of demilitarized zone where both sides have to agree feels like 1) the only solution that isn't effectively one side saying "just do it my way or else" and 2) gives the consumer (who, rightly or wrongly, is the one who gets blamed when things break) a means for being able to keep bad data out without support from the producer. The producer may make their life harder, sure, but better that than the producer being able to break it.
So, this all makes sense, except it still leaves open what seems like the really big question I still have about data contracts: What's in it for the creator?
There's a contract between muesli producers and oatmeal enthusiasts because muesli producers *want to sell muesli.* If they do a good a job of producing it - and the oatmeal people trust it - they'll make money. That's not really true for data providers. They're software developers building software, or sales ops leaders building a CRM. They want to do those things first, and if they can provide good data as exhaust, great. But it's a secondary obligation, and one that we don't reward them for it. (And sure, there are teams who exist to provide data as inputs to models and all of that, and in those cases, sure, data contracts are useful formalizations. But that seems to be the less common case.)
You allude to this at the end of the post when you ask where's the value. And I agree that those things could be built on top of better data, but that's strikes me as an unconvincing sales pitch to the data producers: If you give us good data, imagine all of these cool things we could hypothetically do. It seems like we'd need more concrete reasons to get people to agree to the contract before people would agree to the contract.
It is if the team's job is collect and emit data (they
Unfortunately it looks like the end of your comment got cut off just as it got interesting!
Yes, I definitely concede that's still a concern: Funnily enough, when I ran a draft of this by a data engineer friend of mine he also commented that this all makes sense but seemed to be a bit one-sided - what's in it for the data engineer? I think that's quite hard to define - indeed, defining the value of data teams is - imho - inherently quite hard. I'm leaning towards the goal of data teams is to make data more valuable https://twitter.com/imightbemary/status/1601274970272780289
But I think there are a couple of things we can consider:
- Humans are a funny bunch, we do quite a lot of things altruistically (writing blog posts for example) - what's in it for the author? Some kudos perhaps? Does the accumulated time spent reading a post outweigh the time spent writing it? Where's the return? I think there's quite a lot of 'doing the right thing' in all building of Products.
- Everyone likes good documentation with their Data - although I definitely concede that's not often the case - is it so much of a stretch to ask Data Engineers to record some of that documentation in a Data Contract? And when we come to building 'Products' we should design by 'affordance' - they should be easy and obvious to use - and that should include documentation - and I believe _Contracts_.
But I'm not sure altruism will cut it! :)
- I mentioned in the post that the typical flow is something like Raw --> Landed --> Data Product and I talk a bit about 'ingredients', but of course it's more complicated than that. More like: Stream raw data from App --> on-line database --> CDC to Kafka --> Land raw data in Data Lake --> ETL/ ELT within Data Lake --> Transform/ Combine --> Data Warehouse --> Data Product --> Combine Data Products --> BI Tool/ Model/ Analysis/ API - there are a lot of 'ingredient' products.
If every 'hop' has a Data Contract which is as simple as necessary for the purpose for serving data to their users, you can build/ compose those into a final Data Contract that should be relatively simple/ low-cost to create. And I think those are pretty lightweight:
- App --> Database is probably just guarantee at least one message
- Database --> Kafka is probably that ^ plus schema
- Kafka --> Data Lake is that ^ plus tags
etc, etc. (I concede that any transformation work becomes more complex)
And of course every engineer here (making the assumption of different teams) benefits from the upstream team's contract, so encouraging acting as the proverbial good citizen, we might nudge that team to produce a contract for the customers they serve.
A lot of the information needed is well known and understood at build-time - some clever deployment script work could probably go a long way. Some tooling would definitely help.
And of course there is also a precedent - the OpenAPI Specification for APIs is really quite similar in many respects.
So yes, that 'why' is still something to be explored, but maybe the cost isn't prohibitive...
Ahh, I think what I was going to say was that if a team's job is to provide data, then I'm assuming they'd be much more open to "signing" a data contract, and I'm all for them. But if we limit to those cases, the whole data contract thing seems like a much smaller deal, because 1) it seems like it's just a programmatic formulation of agreements that are already in place, which is good, but kinda just an alert 2) doesn't address the bulk of the data quality problems. Neither of which means it's not a good innovation; just a limited one.
So on the bigger question of getting data providers on board, I think altruism is probably mostly good enough actually; the problem, though, seems like converting that altruism into a willingness to commit to something.
Take the App contract: Database is going to guarantee to send at least one message. In most cases, I'd assume that's what app builders would want anyway. So the contract is, "don't break it." Sure, they'd agree to trying to do that - they don't want to break things either. And they'd probably agree to it altruistically, because they want to help out.
But say the contract is something that causes more interference, like a billing plan type only being one of two values. They may want to change that. So would they agree to the contract? Yes, but like, not really? They'd say sure, we can try to commit to that now too, because we don't want to break stuff either. But we might also change it later, and the best we can do is tell you that it's changing.
But in both of these cases, the contract is essentially just documentation and alerting. And that's ultimately what I think we should be focused on instead. Instead of trying to get commitments from other people, we should start with our own needs, document and alert ourselves when they change, and then use those headaches to get more buy in from other people to be more careful about what they do. That's not structurally *that* different, but it's organizationally very different, because it doesn't seem to offload the problem on someone else. As contracts are proposed now, it feels a lot like what we're trying to do is tell eng teams we're frustrated by their bugs and we would like them to sign an agreement not to create any more.
Yep, I think I'd agree with that. And organisationally speaking, I definitely agree..
Perhaps the approach then is more along the lines of; "you're doing this thing to improve DQ/ Documentation/ reduce outages/ whatever, anyway - but if you did that thing in *this* manner - which we call a 'contract' - and it's kind of standardised, then we'd enable all these use cases too.. And please tell your friends."
Ah, I like that version of it, where the trade is something like, we'll write the documentation for you if you agree to maintain it. Rather than us saying "here are our demands, comply," we're saying, "hey, we documented and built alerts around what you're trying to produce, if you can agree to keep to this documentation, we'll keep telling you when something unexpected happens." In that way, we're not the cops; we're more of a QA team. We'll do QA for you if you agree to fix stuff that break.
I like this a lot. I very much agree on the first point that we should worry more about failing gracefully ourselves, before pushing hard on others to save us from that failure. We can and should ask for help, but I don't think it's reasonable for us to ask other people to care about our problems as much as we need to.
I also like the bit about transparency mattering more than compliance. I think that's the root of this whole data contract thing, to be honest: It feels like trying to put rules in place when collaboration is what we really need. To your point, it's going to be messy; we can't really control that. Governing the mess probably doesn't work.
So are data contracts basically what we'd have handled in "the old days" with CHECK constraints and RI on the destination table? Define what is logically acceptable in the DDL and let the RDBMS keep bad data out at load time, and make sure that the tool loading the data from raw understands how to handle constraint violations?
Basically? Most old architectures are probably good, actually. When we reinvented a lot of data tools, we ended up throwing out a lot of babies with the bathwater (eg, semantic layers). That doesn't we should reincarnate the old stuff exactly as it was. But the fact it was done similarly in the past is probably a good sign rather than a bad one.
But what about when source systems change semantics without changing schema? A field formerly represented stock in each and now represents stock in 100. A field used to contain a 10-character alphanumeric string as an identifier, but that has been expanded to 15 characters. These kinds of semantic shifts can be virtually impossible to detect early without sophisticated, column-level anomaly detection. I hate the idea that the "solution" is to throw engineering hours and cloud spend at what's fundamentally a change management problem.
Absolutely nobody would be fine if IT changed everyone's Salesforce logins over the weekend with no notification. So why can sales ops add or remove fields or completely change field behavior without notifying other stakeholders in the business? I understand this is how things have worked in the data space basically forever, but can't we strive to be better?
Absolutely, they *should* tell people; I'm not arguing against that. But I think that should be the extent of their responsibility. If it's better for sales to represent stock in 100s, they should do that without needing to consult anybody.
More broadly, I fully agree that this kind of "semantic observability" is the really hard problem. And I don't really have any idea how we deal with it: https://benn.substack.com/p/semantic-observability
But doing so impacts the reports that purchasing and warehousing rely on to do their jobs, or it impacts financial reports that ultimately surface to C-level or the street. A small change can cause dozens, hundreds, or thousands of hours of work downstream. How do we handle this very real burden data source owners are putting on the rest of the business?
But I think your point is that this ripple effect is in itself dysfunctional. Does it really make sense that Bob in sales ops can cause 50 different reports to fail by deprecating a custom field? Why are we even in the position where teams who are just using software (albeit complex enterprise systems) can cause widespread damage through routine changes?
As you said, there's no easy fix. Change management is burdensome, but the fallout from allowing teams to eschew it is huge. On the other hand, the virtually exclusive model for populating a data warehouse / lake / lakehouse is to dump a bunch of operational data into it. I'm sure that Bob would be more than happy if we all got out of his business and let his team operate from his ERP's reporting system.
To me, that's the exciting thought behind data contracts: that we can start to build software that decouples operational data stores from analytical outputs, and we can have the people best equipped to define and manage semantics (product owners, applications developers, and systems operators) to help establish standards for what information their systems will emit.
The million dollar question is how we ride out the wave until we have practices and technologies that actually get us there. I'll anxiously be awaiting the answer in next week's post!
That seems like it's mostly just damning of modern data stack as a whole. Which might be the real question in all of this. If something can fall apart so easily, or we need this entire system of rules and governance to prevent one person from breaking everything, the problem might not be the rules or the person who broke it; it might be the thing itself.
I don't know what we do about that either though. It seems like our current approach is to keep throwing more supports in various places (more tests, observability, data contracts, etc) and hope we hold it all together. Though maybe that's the whole pitch behind data contracts (and to some extent, the data mesh) is what you actually need is an architectural change, not another testing framework.
I really like the way schema-ver approaches
change management: MODEL-REVISION-ADDITION
https://docs.snowplow.io/docs/pipeline-components-and-applications/iglu/common-architecture/schemaver/
It may not be perfect, but I hope (fear?) this properly characterizes the problem that CAN be solved.
Two thoughts.
First, there is a category of issues around source data that cannot be detected nor verified with automated testing. It can't be known by the recipient of the data unless the provider of the data informs them of it. And this is related to a category of statements around data contracts that by definition must be at the business meaning level and not the technical level.
Let me give an example of an inter-company interface that we've built recently which relates to the business-meaning level of data contracts.
We have a client for whom we have built an analytics platform which has, among its data sources, payroll information. The payroll information we receive has to do with employees working on contracts (I know, employees working on contracts seem strange, but bear with me, it's an unusual kind of payroll situation, and I can't give more business detail without providing inappropriate information.)
So, the incoming data looks a bit like this:
EMPLOYEE_ID | CONTRACT_ID | BUNCH_OF_OTHER_FIELDS
It is provided in our Snowflake environment via a Snowflake data share (private) and the vendor does a batch recalculation of the payroll data every night.
The payroll data is supposed to provide one record every time an employee starts on a new contract. So the grain is one row per EMPLOYEE_ID per CONTRACT_ID. The business meaning of this is that an employee can be working on multiple different contracts. Any new contract is supposed to be defined by a new combination of EMPLOYEE_ID and CONTRACT_ID. There can be updates to the various other fields associated with a contract such as the contract start date, end date, payment rate, location, etc. When a contract is updated, theoretically, the fields associated with the same EMPLOYEE_ID and CONTRACT_ID should just be updated.
Also, the payroll vendor does have a fairly normalized schema upstream of a number of different tables which are the back-end of their payroll processing system, but they will not provide that for various legal reasons (which don't make sense, and we've tried to get it, but they refuse.) They only provide this one denormalized table.
Well, we recently found out that sometimes the data entry team at the payroll vendor, instead of editing existing contract data, sometimes just creates a new record - same EMPLOYEE_ID but new CONTRACT_ID. And we also continue to get the old record. Records don't age out of the feed until a couple of years after the contract date. What makes this even more challenging is that it is very possible, and fairly common, for an employee to legitimately be working on two or more different contracts. We can't reliably tell the difference between what is actually a true, new contract, versus what is actually a change to an existing contract. Yes, we could try to get into all sorts of fancy comparisons on other fields, but that is error-prone and unreliable, and any rules we create around that could also break as the habits of upstream data entry folks change. We also don't get actual database insert or update timestamps on the data, and because it is denormalized down from quite a number of source tables in a way that is not transparent to us, it wouldn't really help anyway.
Thus, there's no reliable way for us to write a rule to detect when the upstream payroll vendor is erroneously entering contract changes as new contracts instead of editing existing contracts as they should. Technically, everything is correct - there are no duplicates when looking at the compound primary key of EMPLOYEE_ID and CONTRACT_ID. But - the business rule, the spirit of the thing, is being violated. I don't see a way to automate this or even detect it on our end except for actually getting the data provider to agree to the spirit of the data contract, and to put effort into working with their data entry team, and also monitoring what they are entering on an ongoing basis to make sure that it is correct.
Second thought.
I also think that discussing how data contracts work WITHIN an organization - where there can be better informal agreements and negotiation - is a very different conversation regarding how data contracts can/need to/must work BETWEEN entirely different organizations. You can do a lot more legwork and relationship building where, when being within a given organization, team members at least theoretically share some sense of mission, allegiance, or goal. When going entirely outside of your organization and to an entirely different organization, and needing to build an ingest or feed with a whole different firm that likely has an entirely different set of incentives, timelines, goals, and personnel, it gets a lot tougher. My thoughts here on building data interfaces between organizations here, which are, unfortunately, a lot more formal and perhaps even litigious, than yours: https://jrandrews.net/risks-of-interfaces-with-partners/
TL;DR - IMHO between organizations you really need to have an actual *legal* contract, with the technical folks involved in the negotiations and not just the attorneys, and there needs to be actual specific financial penalties for each enumerated breach of contract, to really push large organizations to work together.
On the second point, I agree - it's a very different game between companies than within them. Which is part of why I'm somewhat skeptical of the whole contract architecture in the first place, because it applies something that's meant to govern two somewhat adversarial parties to a relationship that should be more harmonious. It won't always be, sure, but the better solution seems to just be something like, "work together and get along." Just as we don't have to use parliamentary procedure to talk over the dinner table, teams shouldn't need legal-like contracts to work together.
On the first point, I'm kinda mixed on this. There's an even simpler version of this that nearly every team deals with: Sales people have to enter contract data into Salesforce and they type it manually. They can just get it wrong. No contract or test will catch when a rep inputs that they sold a deal for $10,000 that was actually worth $15,000. The rule is basically, do it right.
On one hand, that problem *has* to live with the data producer. On the other hand, if we want to claim that "get it right" is a contract, that really blurs the line for me for what a contract actually is. Moreover, while we (as data people) can't really test for that problem, I still think it's on us to make what we have as robust as we can. There are plenty of things that we can catch that aren't like this, or like the example you described, and I think we should worry about that a lot more than defining these legal interfaces with other teams.
What does it mean to "work together and get along?" Maybe this is a mental model problem. There are two models that I can think of for corporations of any real significant size to receive services from an external entity:
1. A transactional negotiation, so, a utility. Power, water, mail/shipping services, etc. Terms are relatively concrete, definable, granular, and discrete. Kilowatt/hours, gallons, packages delivered within a given timeframe, etc.
2. A "relational" negotiation. Where the terms of the agreement are difficult to entirely provide numerically, there is some level of understanding required, and the whole is greater than the sum of its parts. A few examples of this - hiring outside legal counsel, hiring an outside accounting firm, anything regarding data...
When we sell things to others, we always want a relational sale because it is stickier and it is harder for the buyer to disconnect. But, as buyers, we usually want transactional relationships because we want simplicity, greater objectivity, and more optionality to switch to another provider of the service if the seller is not providing what we want.
As buyers, do executives and decision-makers (who often are not technical) want #1 or #2? They will always gravitate to #1.
"Relational relationships" take a lot more effort than transactional ones do, and all executives become exhausted with negotiation and relationship-building. Over time, the draw, in all spheres, is to move towards saying "can't it just be simpler" (regardless of whether it can or not), and also "I don't understand and because I don't understand you should make it simpler so I can." We will never get to a world where all executives are truly data-savvy, and I'm not sure we should either, because understanding other areas of a business are just as important.
So by saying "we should work together and get along", if I understand correctly, we are saying "you should commit over time to spending more effort and energy and be more entangled with external 3rd parties." No decision-maker, particularly no non-technical decision-maker, is going to choose that unless they absolutely have to, and even if they "have to" from an objective perspective, if they don't understand why they have to, they'll still try to make alternative choices. I understand why it is hard to boil down data interfaces to transactional relationships, for the reasons we have discussed above, but it was also hard to boil down many other now-transactional services to become transactional when they started in the early days. Thus my blog post was an attempt to start to boil down interfaces to transactional relationships, depressing though the contractual language may be.
That feels overly pessimistic to me. I agree that, at certain scales, we need the transactional relationships. But at smaller scales, I think it's actually far more expensive to build that transactional relationship than it is to build the relational one.
Take the extreme example, of a data person and a business person working together to solve a specific problem. The collaboration model there is, "get in a room and figure it out." That's much faster and much easier than trying to build some formal transactional relationship where they only communicate in very concrete ways. At that scale, I'd argue the opposite point from what you're saying: No decision maker (or reasonable person) is going to choose the transactional relationship unless they absolutely have to.
My broader point, then, is that things like data contracts (and the data mesh) are often over-engineered solutions that are probably necessary at very large scales, or between somewhat adversarial parties (like 3rd parties), but aren't necessary for working teams. I do think you're on to something about a lot of people wanting to avoid relational relationships, but I don't think that makes that behavior right. I think it's because people are mostly lazy and don't want to do the hard work of building the relationship, and think they can shortcut it with rules. In practice, though, that's not how collaborative organizations work. Sometimes, the thing people don't want to do is the thing they should do.
I think we are actually more agreeing then disagreeing. I was more specifically talking about the larger-scale inter-company (not intra-company) kinds of things that need to persist past individuals and for long periods of time - like at least a few years.
That's a good point too, how long it needs to last matters. It's not a thing we talk about often.
The proposed data contract arch feels like a combination of introducing version control and schema validation into general data management. I'm actually building something similar.
With this staging and schema validation approach, it feels like data producer and data consumer will author their own schema and both write to staging store, do the test/validation before publishing/promoting to production. However, this seems make the transformation process (dbt) to be responsible to maintain the compatibility between data producer and consumer as any schema change from either side could break the transformation. Fixing such issue could become a headache as the ownership of this transformation could be split between data producer and consumer.
Yeah, agree that it could be a headache, but what's the alternative? Plus, having a kind of demilitarized zone where both sides have to agree feels like 1) the only solution that isn't effectively one side saying "just do it my way or else" and 2) gives the consumer (who, rightly or wrongly, is the one who gets blamed when things break) a means for being able to keep bad data out without support from the producer. The producer may make their life harder, sure, but better that than the producer being able to break it.
Hi Benn - I thought I'd add to the Data Contract discussion and I thought you might be interested :) https://medium.com/@maxillis/on-data-contracts-data-products-and-muesli-84fe2d143e2c
Also references your 'better calendar' post
Nice, thanks for sharing!
So, this all makes sense, except it still leaves open what seems like the really big question I still have about data contracts: What's in it for the creator?
There's a contract between muesli producers and oatmeal enthusiasts because muesli producers *want to sell muesli.* If they do a good a job of producing it - and the oatmeal people trust it - they'll make money. That's not really true for data providers. They're software developers building software, or sales ops leaders building a CRM. They want to do those things first, and if they can provide good data as exhaust, great. But it's a secondary obligation, and one that we don't reward them for it. (And sure, there are teams who exist to provide data as inputs to models and all of that, and in those cases, sure, data contracts are useful formalizations. But that seems to be the less common case.)
You allude to this at the end of the post when you ask where's the value. And I agree that those things could be built on top of better data, but that's strikes me as an unconvincing sales pitch to the data producers: If you give us good data, imagine all of these cool things we could hypothetically do. It seems like we'd need more concrete reasons to get people to agree to the contract before people would agree to the contract.
It is if the team's job is collect and emit data (they
Thank you Benn!
Unfortunately it looks like the end of your comment got cut off just as it got interesting!
Yes, I definitely concede that's still a concern: Funnily enough, when I ran a draft of this by a data engineer friend of mine he also commented that this all makes sense but seemed to be a bit one-sided - what's in it for the data engineer? I think that's quite hard to define - indeed, defining the value of data teams is - imho - inherently quite hard. I'm leaning towards the goal of data teams is to make data more valuable https://twitter.com/imightbemary/status/1601274970272780289
But I think there are a couple of things we can consider:
- Humans are a funny bunch, we do quite a lot of things altruistically (writing blog posts for example) - what's in it for the author? Some kudos perhaps? Does the accumulated time spent reading a post outweigh the time spent writing it? Where's the return? I think there's quite a lot of 'doing the right thing' in all building of Products.
- Everyone likes good documentation with their Data - although I definitely concede that's not often the case - is it so much of a stretch to ask Data Engineers to record some of that documentation in a Data Contract? And when we come to building 'Products' we should design by 'affordance' - they should be easy and obvious to use - and that should include documentation - and I believe _Contracts_.
But I'm not sure altruism will cut it! :)
- I mentioned in the post that the typical flow is something like Raw --> Landed --> Data Product and I talk a bit about 'ingredients', but of course it's more complicated than that. More like: Stream raw data from App --> on-line database --> CDC to Kafka --> Land raw data in Data Lake --> ETL/ ELT within Data Lake --> Transform/ Combine --> Data Warehouse --> Data Product --> Combine Data Products --> BI Tool/ Model/ Analysis/ API - there are a lot of 'ingredient' products.
If every 'hop' has a Data Contract which is as simple as necessary for the purpose for serving data to their users, you can build/ compose those into a final Data Contract that should be relatively simple/ low-cost to create. And I think those are pretty lightweight:
- App --> Database is probably just guarantee at least one message
- Database --> Kafka is probably that ^ plus schema
- Kafka --> Data Lake is that ^ plus tags
etc, etc. (I concede that any transformation work becomes more complex)
And of course every engineer here (making the assumption of different teams) benefits from the upstream team's contract, so encouraging acting as the proverbial good citizen, we might nudge that team to produce a contract for the customers they serve.
A lot of the information needed is well known and understood at build-time - some clever deployment script work could probably go a long way. Some tooling would definitely help.
And of course there is also a precedent - the OpenAPI Specification for APIs is really quite similar in many respects.
So yes, that 'why' is still something to be explored, but maybe the cost isn't prohibitive...
Ahh, I think what I was going to say was that if a team's job is to provide data, then I'm assuming they'd be much more open to "signing" a data contract, and I'm all for them. But if we limit to those cases, the whole data contract thing seems like a much smaller deal, because 1) it seems like it's just a programmatic formulation of agreements that are already in place, which is good, but kinda just an alert 2) doesn't address the bulk of the data quality problems. Neither of which means it's not a good innovation; just a limited one.
So on the bigger question of getting data providers on board, I think altruism is probably mostly good enough actually; the problem, though, seems like converting that altruism into a willingness to commit to something.
Take the App contract: Database is going to guarantee to send at least one message. In most cases, I'd assume that's what app builders would want anyway. So the contract is, "don't break it." Sure, they'd agree to trying to do that - they don't want to break things either. And they'd probably agree to it altruistically, because they want to help out.
But say the contract is something that causes more interference, like a billing plan type only being one of two values. They may want to change that. So would they agree to the contract? Yes, but like, not really? They'd say sure, we can try to commit to that now too, because we don't want to break stuff either. But we might also change it later, and the best we can do is tell you that it's changing.
But in both of these cases, the contract is essentially just documentation and alerting. And that's ultimately what I think we should be focused on instead. Instead of trying to get commitments from other people, we should start with our own needs, document and alert ourselves when they change, and then use those headaches to get more buy in from other people to be more careful about what they do. That's not structurally *that* different, but it's organizationally very different, because it doesn't seem to offload the problem on someone else. As contracts are proposed now, it feels a lot like what we're trying to do is tell eng teams we're frustrated by their bugs and we would like them to sign an agreement not to create any more.
Yep, I think I'd agree with that. And organisationally speaking, I definitely agree..
Perhaps the approach then is more along the lines of; "you're doing this thing to improve DQ/ Documentation/ reduce outages/ whatever, anyway - but if you did that thing in *this* manner - which we call a 'contract' - and it's kind of standardised, then we'd enable all these use cases too.. And please tell your friends."
Interesting chat Benn - thank you :)
Ah, I like that version of it, where the trade is something like, we'll write the documentation for you if you agree to maintain it. Rather than us saying "here are our demands, comply," we're saying, "hey, we documented and built alerts around what you're trying to produce, if you can agree to keep to this documentation, we'll keep telling you when something unexpected happens." In that way, we're not the cops; we're more of a QA team. We'll do QA for you if you agree to fix stuff that break.
And same, thanks for all the thoughts!
This essay provoked me so deeply my intended "comment" evolved into a full-fledged blog post:
https://ihack.us/2022/09/23/beyond-data-contracts-a-response-to-benn-stancil/
Would love your feedback!
I like this a lot. I very much agree on the first point that we should worry more about failing gracefully ourselves, before pushing hard on others to save us from that failure. We can and should ask for help, but I don't think it's reasonable for us to ask other people to care about our problems as much as we need to.
I also like the bit about transparency mattering more than compliance. I think that's the root of this whole data contract thing, to be honest: It feels like trying to put rules in place when collaboration is what we really need. To your point, it's going to be messy; we can't really control that. Governing the mess probably doesn't work.