It's probably similar. The difference, I think, is you can adopt dbt with less ambition. It can be a narrow solution to clear problem (building data models) and expand from there. Tools like Denodo could potentially fill the same space, but they 1) require a lot of effort to get going, and 2) are often hard to understand. While dbt could become a buzzy thing like a "data OS," it doesn't have to start there.
I like the analogy with an OS, but I doubt we will have something like this in the next couple of years, and even more skeptical of dbt's ability to pull this. dbt is a transformation layer with some cool things aside (based on ref it creates the lineage and processing dependencies, some minimal testing capabilities, and a couple more minor things related to deployment, snapshots), but it's a long shot to convert everything there is out there to dbt. dbt is still not suitable for real-time processing, having native transaction or procedural support, mutations, and many more. Oh and btw dbt does integrate today with lots of vendors (Airbyte, Fivetran, Hightouch, Paradime, etc.), and acts as the main hub. It's just is a small, limited hub :)
I think we should have an open protocol and enforce it on the big players (Snowflake,
BigQuery, Databricks). The protocol will define how each vendor will integrate and exchange data with other vendors and a suite of tests to pass for when you integrate. The underlying layer (OS) should be the cloud data warehouse platform, and every such platform to adhere to those standards.
I don't disagree that an open standard is ideal; the question for me is how do we get there? This isn't just hypothetical either - SQL itself is a long-standing open standard with a ton of community traction, and databases can't even stick to one version of that.
To me, dbt is in a good position here largely because it can become the standard without people agreeing to it. If people use it, we can move in that direction; there doesn't need to be any central coordination or consortium to decide this is now the way to do things.
dbt does "in batch" what Trino/Presto does "live". Trino/Presto is able to join across disparate databases, e.g. more than a switchboard (or at least the same level of switchboard as dbt). So either (or both) are viable and great options because they are both SQL-based abstractions.
I also agree the formal definition of the mesh is very lacking on: "What happens on the other side of the mesh, or even what the mesh actually is, isn’t discussed"
I agree that both could be; the question for me is will either inject even more into their abstractions. It's one thing to fairly quietly pass one flavor of SQL into another database; it's something quite different (and potentially better, though who knows) to abstract away much more complex operations and functions.
Thanks for the reply. Yes, I agree with that, but it starts to turn into a slippery slope because you are again centralizing rather than decentralizing. To me centralization should happen at the metadata layer, that should be the glue that allows all the magic to happen on top. Magic = common language to query data (which, imo, should be SQL) and that can happen interactively (Presto) or in batch (dbt) and use metadata as the mechanism to create the meaningful abstractions so it all "works".
Yeah, this mostly makes sense to me. The one gap to me is how you handle cases that blend what Presto + dbt do. Presto is the interactive query layer; dbt is the batch modeling layer. But sometimes (i.e., metrics) you need to model interactive things. To me, the mesh/OS/whatever layer would ideally help out here.
I think that's the one place where I disagree with your post. Governance isn't just access control. It's also governance of business logic, which includes both batch modeling and interactive metric calculation. The Presto + dbt combination doesn't yet help with the latter.
Isnt this mess part of the process of technology evolution? Think about how we consumed music in the 90's. Grandparents had record players, because their collection was big and they sounded good. Parents had some records, maybe a few 8 tracks, and a decent tape collection (even if some of those tapes were mixed from the record collection). Poor kids had tapes, wealthy kids had CD's and rich kids had mini-disks. Then the 2000's trashed all that with MP3s, MP4s, Itunes and Napster. Now the cool kids are buying records again.
Also, as I am applying for my first analyst job, I see this same type of tool overload disfunction every day. Each job board wants to process my resume slightly differently. Then the application might happen through the job board itself, or redirect to the company. Sometimes the company processes it internally, sometimes it is processed by a 3rd party where I have to set up an account and password. Sometimes the company I am applying to is a staffing company, a recruiting company or a contract company and they use a 3rd party processor to vet for the 4th? party. Sometimes the posting will be a cross posting from another job board, that is a recruiter posting for a 3rd party that uses a 2nd 3rd party to process the application. Maybe applications would be a good application for "The Mesh"?
The job application point is an interesting one. That one feels like the xkcd comic about standards (https://xkcd.com/927/), where anytime anyone tries to unify it, they end up fragmenting it further. (I've always felt something similar about messaging services actually, where I want to have one place to get texts, iMessages, Messenger messages, WhatsApp messages, etc, rather than having five places to remember to text people back)
Isn't this, in essence, Denodo?
It's probably similar. The difference, I think, is you can adopt dbt with less ambition. It can be a narrow solution to clear problem (building data models) and expand from there. Tools like Denodo could potentially fill the same space, but they 1) require a lot of effort to get going, and 2) are often hard to understand. While dbt could become a buzzy thing like a "data OS," it doesn't have to start there.
I like the analogy with an OS, but I doubt we will have something like this in the next couple of years, and even more skeptical of dbt's ability to pull this. dbt is a transformation layer with some cool things aside (based on ref it creates the lineage and processing dependencies, some minimal testing capabilities, and a couple more minor things related to deployment, snapshots), but it's a long shot to convert everything there is out there to dbt. dbt is still not suitable for real-time processing, having native transaction or procedural support, mutations, and many more. Oh and btw dbt does integrate today with lots of vendors (Airbyte, Fivetran, Hightouch, Paradime, etc.), and acts as the main hub. It's just is a small, limited hub :)
I think we should have an open protocol and enforce it on the big players (Snowflake,
BigQuery, Databricks). The protocol will define how each vendor will integrate and exchange data with other vendors and a suite of tests to pass for when you integrate. The underlying layer (OS) should be the cloud data warehouse platform, and every such platform to adhere to those standards.
You'll find that in the meantime Databricks is working on this open standard: https://databricks.com/blog/2021/05/26/introducing-delta-sharing-an-open-protocol-for-secure-data-sharing.html
I don't disagree that an open standard is ideal; the question for me is how do we get there? This isn't just hypothetical either - SQL itself is a long-standing open standard with a ton of community traction, and databases can't even stick to one version of that.
To me, dbt is in a good position here largely because it can become the standard without people agreeing to it. If people use it, we can move in that direction; there doesn't need to be any central coordination or consortium to decide this is now the way to do things.
Great article!
To me, SQL is the "common utility".
dbt does "in batch" what Trino/Presto does "live". Trino/Presto is able to join across disparate databases, e.g. more than a switchboard (or at least the same level of switchboard as dbt). So either (or both) are viable and great options because they are both SQL-based abstractions.
I also agree the formal definition of the mesh is very lacking on: "What happens on the other side of the mesh, or even what the mesh actually is, isn’t discussed"
I agree that both could be; the question for me is will either inject even more into their abstractions. It's one thing to fairly quietly pass one flavor of SQL into another database; it's something quite different (and potentially better, though who knows) to abstract away much more complex operations and functions.
Thanks for the reply. Yes, I agree with that, but it starts to turn into a slippery slope because you are again centralizing rather than decentralizing. To me centralization should happen at the metadata layer, that should be the glue that allows all the magic to happen on top. Magic = common language to query data (which, imo, should be SQL) and that can happen interactively (Presto) or in batch (dbt) and use metadata as the mechanism to create the meaningful abstractions so it all "works".
Here's a shameless plug on something I wrote about this from an Immuta perspective (my company): https://www.immuta.com/articles/sql-is-your-data-mesh-api/
Yeah, this mostly makes sense to me. The one gap to me is how you handle cases that blend what Presto + dbt do. Presto is the interactive query layer; dbt is the batch modeling layer. But sometimes (i.e., metrics) you need to model interactive things. To me, the mesh/OS/whatever layer would ideally help out here.
I think that's the one place where I disagree with your post. Governance isn't just access control. It's also governance of business logic, which includes both batch modeling and interactive metric calculation. The Presto + dbt combination doesn't yet help with the latter.
Isnt this mess part of the process of technology evolution? Think about how we consumed music in the 90's. Grandparents had record players, because their collection was big and they sounded good. Parents had some records, maybe a few 8 tracks, and a decent tape collection (even if some of those tapes were mixed from the record collection). Poor kids had tapes, wealthy kids had CD's and rich kids had mini-disks. Then the 2000's trashed all that with MP3s, MP4s, Itunes and Napster. Now the cool kids are buying records again.
Also, as I am applying for my first analyst job, I see this same type of tool overload disfunction every day. Each job board wants to process my resume slightly differently. Then the application might happen through the job board itself, or redirect to the company. Sometimes the company processes it internally, sometimes it is processed by a 3rd party where I have to set up an account and password. Sometimes the company I am applying to is a staffing company, a recruiting company or a contract company and they use a 3rd party processor to vet for the 4th? party. Sometimes the posting will be a cross posting from another job board, that is a recruiter posting for a 3rd party that uses a 2nd 3rd party to process the application. Maybe applications would be a good application for "The Mesh"?
Yeah, I don't think the mess is necessarily a sign of things going wrong. Analytics work goes through a similar process (https://benn.substack.com/p/analytics-is-a-mess), as does any sort of product design. (https://medium.com/design-leadership-notebook/the-new-double-diamond-design-process-7c8f12d7945e). In some ways, the mess is a good thing. Provided, at least, that at some point it starts to get cleaned up.
The job application point is an interesting one. That one feels like the xkcd comic about standards (https://xkcd.com/927/), where anytime anyone tries to unify it, they end up fragmenting it further. (I've always felt something similar about messaging services actually, where I want to have one place to get texts, iMessages, Messenger messages, WhatsApp messages, etc, rather than having five places to remember to text people back)