The most brilliant article ever. Consider me a disciple.
I do have a quibble with the title though. You are (rightly!) calling for Pull based DAG execution instead of the Push execution (as in dbt run). The DAG itself is still invaluable, the problem is the push (vs. pull) orchestration model.
As Ernest said, the title is a bit of an artist liberty. The DAG still needs to exist, but as an under-the-hood thing, not as the primary interface.
I had a bit that I cut for the final piece about how we tend to ship technical designs to users. If some problem needs a particular skeleton to solve (eg, a dag) we make an interface for that skeleton. In some cases, that works; in other cases, it's just confusing. DAGs are the latter for me. (I'd argue we also do this with olap cubes: https://benn.substack.com/p/ghosts-in-the-data-stack)
I think the point is that we shouldn’t treat the DAG as the primary abstraction. Rather, I specify what I care about for this particular model, and the system is empowered to create as hoc DAGs as needed to generate that data. Very much the way C make files determine which source code to recompile.
We are all saying the same thing. The DAG represents dependencies, not executions. We are just using a different decision / timing system to decide when to execute each arc.
I agree, but I think the big shift Benn is pitching is for users to NOT need to be aware of (work around) the DAG, but the DAG to dynamically reconfigure to support user expectations.
I am not sure I follow. The dependencies are static, are they not? This means the DAG is also static. The execution of individual arcs in the DAG is the dynamically determined part, not the DAG itself, no? The system does not create dependencies on the fly. The way I see it, the centrality of the DAG as a representation of dependency and lineage and data flow s not diminished. What am I missing?
Ah, you hit the nail on the head. Yes, the DAG in theory is just a list of dependencies. However, in practice it does NOT contain information about the desired latency/freshness/deadline for each component, and thus the whole thing ends up being rerun en masse.
The call is to annotate the DAG with this additional metadata, so the system can intelligently determine the “minimal subset” that must be run to satisfy the users desires.
Sounds like you are proposing a declarative approach to data orchestration which would be similar to how kubernetes approaches container orchestration in that you declare. a desired state and the system figures out how to get to that desired state.
I am definitely on board with this. Intuitively I’ve felt this way for quite some time, but had to read your brilliance to articulate it.
I wish more people in data engineering had bash and C skill sets , or at least more Java. Plenty of patterns there for some transfer learning.
I still haven’t used dbt but I’m pretty sure the Oracle pattern the previous engineer built (code base I know own) wrote his own dbt with just sql and bash. With a lot more steps… but still. Impressive
Yeah, Dagster is the closest thing that I know of, though it's not quite exactly this (and layers on a bunch of other orchestration functionality/complexity that wouldn't strictly be needed for just a scheduler).
I like the idea of working backwards, and the analogy of how a passenger only cares about the departure/arrival time makes a lot of sense! I think this framework works great for scheduled, regular jobs.
But out of curiosity Benn, how did/would your reverse orchestration system deal with ad-hoc data dumps from production databases?
Maybe I can share an example here - we have a table (that's built incrementally with daily data) that has a column with weird responses because of a buggy web scraper. If I would like to run a one-time job to rebuild that day's data to fix that problem so that I can work on that now, rather than waiting till tomorrow, how would I do that in your system?
Ah, gotcha. In both integritie and easybake, we had ways to "update this table now," which basically just ran the job and upstream jobs (if necessary). It'd operate the same as if you set the table's latency guarantee as some value less than however stale it was currently.
Removing the DAG oriented schedule didn't mean you couldn't do things manually as needed; it just meant the automatic updates happened in a different way.
Excellent article, as always, telling the perfect story. I fully agree and wrote about the same but called it "The Shift From Data Pipelines to Data Products". If anyone is interested, I believe it goes in-depth into what you wish for: https://airbyte.com/blog/data-orchestration-trends.
Thanks! One thing I have to admit, though, is I still haven't fully gotten my head around what all of this orchestration / data mesh / data as a product / software defined asset stuff actually means. I get the abstract, high level idea of everything being more code-oriented, and the system being aware of more things, and automatically doing things, but I can never quite figure out how something like this actually works. As best I can tell, it seems like a DAG, but the node is function.
To me, it's the key to declarative pipelines, as you can declare a data asset/product without running anything. The SW-defined function is like a microservice, or as you said, just the function on a single asset (that can live independently). With more declarative metadata, the orchestrator will figure out the lineage, how to run, etc. The DAGs such as jobs/tasks/ops don't go away; there will always be a need for scheduling something (with jobs/tasks). But if you have an ML model that produces a BigQuery table, you can define upstream data sets. It depends on what might be created outside of your orchestrator by another team that does not need any DAG. That will be a single function (as you call it, or SW-Defined Asset). Not sure if that makes more sense, but that's how I see it, and that's quite revolutionary.
The best thing at the end, you get the actual data lineage of your physical assets, not an arbitrary lineage of tasks (that is interesting for engineers but not for data consumers).
That makes sense, but seems...fairly basic? Instead of saying "run these five scripts sequentially," isn't that just saying "run these five scripts, and be more explicit about what the script returns"? I can see how that has potential, but in and of itself it doesn't do much.
There is an interesting parallel here with Integritie / Easybake here though. In both of those tools, models had two parts: The query *and* table DDL that defined the table's schema (with the CREATE TABLE and whatnot stripped out). I actually preferred this to dbt as well, because it made explicit what you were trying to create.
Is this more or less a "software defined asset" for dbt-type lineage? Where the asset each model creates is defined, as well as the "script" for how to create it?
(FWIW, I preferred that, because it kept the system more stable. But it wasn't a revolutionary improvement over what dbt does)
If you will, you can bring together a different team or tools from the modern data stack. The how it's created is abstracted and figured out by dagster; you only say what you want. I see it similar to Kubernetes, where you define everything declarative, and Kubernetes figures out how many pods to kill, etc., which you won't have in other tools.
I don't like to use revolutionary, or the best, or whatnot, but I think it's solving a significant pain, and it's a bit hard to understand, but I believe it's the future. I also wrote about the shift to data products and a declarative approach in my article.
From what I can tell, that seems like step 2 that doesn't yet exist though. Like,
Step 1: Define a DAG in the normal way with a bunch of procedural tasks. But for each task, specify the structure of the output. That's cool (and what Integritie did).
Step 2: Now that the system knows the desired output and can figure out the procedure. That's not possible without step 1, but you can definitely have step 1 without step 2. It seems like Dagster/software defined assets does step 1 now, but step 2 is still aspirational.
Absolutely fantastic read! Your insights on rethinking data orchestration are spot on. For those who find the intricacies of orchestrators daunting, I highly recommend checking out our guide on orchestrators in data warehouses here:
In our exploration, we delve into various orchestrators that aim to simplify the complexities you’ve highlighted. It's interesting to see the parallels between your suggested pull-based approach and our focus on managed solutions to alleviate the burdens of orchestration. The conversation around evolving these systems to be more efficient and user-friendly is incredibly timely. Kudos to you for shining a spotlight on this crucial topic!
The most brilliant article ever. Consider me a disciple.
I do have a quibble with the title though. You are (rightly!) calling for Pull based DAG execution instead of the Push execution (as in dbt run). The DAG itself is still invaluable, the problem is the push (vs. pull) orchestration model.
Fair: https://twitter.com/bennstancil/status/1558125437909352448
As Ernest said, the title is a bit of an artist liberty. The DAG still needs to exist, but as an under-the-hood thing, not as the primary interface.
I had a bit that I cut for the final piece about how we tend to ship technical designs to users. If some problem needs a particular skeleton to solve (eg, a dag) we make an interface for that skeleton. In some cases, that works; in other cases, it's just confusing. DAGs are the latter for me. (I'd argue we also do this with olap cubes: https://benn.substack.com/p/ghosts-in-the-data-stack)
I think the point is that we shouldn’t treat the DAG as the primary abstraction. Rather, I specify what I care about for this particular model, and the system is empowered to create as hoc DAGs as needed to generate that data. Very much the way C make files determine which source code to recompile.
We are all saying the same thing. The DAG represents dependencies, not executions. We are just using a different decision / timing system to decide when to execute each arc.
I agree, but I think the big shift Benn is pitching is for users to NOT need to be aware of (work around) the DAG, but the DAG to dynamically reconfigure to support user expectations.
I am not sure I follow. The dependencies are static, are they not? This means the DAG is also static. The execution of individual arcs in the DAG is the dynamically determined part, not the DAG itself, no? The system does not create dependencies on the fly. The way I see it, the centrality of the DAG as a representation of dependency and lineage and data flow s not diminished. What am I missing?
Ah, you hit the nail on the head. Yes, the DAG in theory is just a list of dependencies. However, in practice it does NOT contain information about the desired latency/freshness/deadline for each component, and thus the whole thing ends up being rerun en masse.
The call is to annotate the DAG with this additional metadata, so the system can intelligently determine the “minimal subset” that must be run to satisfy the users desires.
Sounds like you are proposing a declarative approach to data orchestration which would be similar to how kubernetes approaches container orchestration in that you declare. a desired state and the system figures out how to get to that desired state.
Basically, yeah (though Kubernetes is the most confusing thing in the world to me, so I can't say if this and that are the same).
Dagster does this. (and describes it with the same wording, Software Defined Assets). K8s is also the most confusing thing in the world to me.
I've heard they have something similar. I need to look into it, clearly.
I am definitely on board with this. Intuitively I’ve felt this way for quite some time, but had to read your brilliance to articulate it.
I wish more people in data engineering had bash and C skill sets , or at least more Java. Plenty of patterns there for some transfer learning.
I still haven’t used dbt but I’m pretty sure the Oracle pattern the previous engineer built (code base I know own) wrote his own dbt with just sql and bash. With a lot more steps… but still. Impressive
There are definitely lots of internal dbts (and internal Fivetrans, and internal Modes, and internal Segments, and...) floating around out there.
I'm not sure how you'd do it, but I bet it be interesting to see what people came up with for those internal tools that vendors haven't.
I have some good news. A declarative pipeline approach like the one you’re describing already exists. It’s very popular and has a lot of features.
https://youtu.be/pRGNxIz6GzU
Yeah, Dagster is the closest thing that I know of, though it's not quite exactly this (and layers on a bunch of other orchestration functionality/complexity that wouldn't strictly be needed for just a scheduler).
I like the idea of working backwards, and the analogy of how a passenger only cares about the departure/arrival time makes a lot of sense! I think this framework works great for scheduled, regular jobs.
But out of curiosity Benn, how did/would your reverse orchestration system deal with ad-hoc data dumps from production databases?
I'm not sure I follow... what's an ad hoc data dump from a production database?
Maybe I can share an example here - we have a table (that's built incrementally with daily data) that has a column with weird responses because of a buggy web scraper. If I would like to run a one-time job to rebuild that day's data to fix that problem so that I can work on that now, rather than waiting till tomorrow, how would I do that in your system?
Ah, gotcha. In both integritie and easybake, we had ways to "update this table now," which basically just ran the job and upstream jobs (if necessary). It'd operate the same as if you set the table's latency guarantee as some value less than however stale it was currently.
Removing the DAG oriented schedule didn't mean you couldn't do things manually as needed; it just meant the automatic updates happened in a different way.
Excellent article, as always, telling the perfect story. I fully agree and wrote about the same but called it "The Shift From Data Pipelines to Data Products". If anyone is interested, I believe it goes in-depth into what you wish for: https://airbyte.com/blog/data-orchestration-trends.
Thanks! One thing I have to admit, though, is I still haven't fully gotten my head around what all of this orchestration / data mesh / data as a product / software defined asset stuff actually means. I get the abstract, high level idea of everything being more code-oriented, and the system being aware of more things, and automatically doing things, but I can never quite figure out how something like this actually works. As best I can tell, it seems like a DAG, but the node is function.
To me, it's the key to declarative pipelines, as you can declare a data asset/product without running anything. The SW-defined function is like a microservice, or as you said, just the function on a single asset (that can live independently). With more declarative metadata, the orchestrator will figure out the lineage, how to run, etc. The DAGs such as jobs/tasks/ops don't go away; there will always be a need for scheduling something (with jobs/tasks). But if you have an ML model that produces a BigQuery table, you can define upstream data sets. It depends on what might be created outside of your orchestrator by another team that does not need any DAG. That will be a single function (as you call it, or SW-Defined Asset). Not sure if that makes more sense, but that's how I see it, and that's quite revolutionary.
The best thing at the end, you get the actual data lineage of your physical assets, not an arbitrary lineage of tasks (that is interesting for engineers but not for data consumers).
That makes sense, but seems...fairly basic? Instead of saying "run these five scripts sequentially," isn't that just saying "run these five scripts, and be more explicit about what the script returns"? I can see how that has potential, but in and of itself it doesn't do much.
There is an interesting parallel here with Integritie / Easybake here though. In both of those tools, models had two parts: The query *and* table DDL that defined the table's schema (with the CREATE TABLE and whatnot stripped out). I actually preferred this to dbt as well, because it made explicit what you were trying to create.
Is this more or less a "software defined asset" for dbt-type lineage? Where the asset each model creates is defined, as well as the "script" for how to create it?
(FWIW, I preferred that, because it kept the system more stable. But it wasn't a revolutionary improvement over what dbt does)
If you will, you can bring together a different team or tools from the modern data stack. The how it's created is abstracted and figured out by dagster; you only say what you want. I see it similar to Kubernetes, where you define everything declarative, and Kubernetes figures out how many pods to kill, etc., which you won't have in other tools.
I don't like to use revolutionary, or the best, or whatnot, but I think it's solving a significant pain, and it's a bit hard to understand, but I believe it's the future. I also wrote about the shift to data products and a declarative approach in my article.
From what I can tell, that seems like step 2 that doesn't yet exist though. Like,
Step 1: Define a DAG in the normal way with a bunch of procedural tasks. But for each task, specify the structure of the output. That's cool (and what Integritie did).
Step 2: Now that the system knows the desired output and can figure out the procedure. That's not possible without step 1, but you can definitely have step 1 without step 2. It seems like Dagster/software defined assets does step 1 now, but step 2 is still aspirational.
Absolutely fantastic read! Your insights on rethinking data orchestration are spot on. For those who find the intricacies of orchestrators daunting, I highly recommend checking out our guide on orchestrators in data warehouses here:
https://dlthub.com/blog/first-data-warehouse
In our exploration, we delve into various orchestrators that aim to simplify the complexities you’ve highlighted. It's interesting to see the parallels between your suggested pull-based approach and our focus on managed solutions to alleviate the burdens of orchestration. The conversation around evolving these systems to be more efficient and user-friendly is incredibly timely. Kudos to you for shining a spotlight on this crucial topic!
Best,
Aman Gupta,
DLT Team.
This is a great read, thank you for sharing the info. We also write about metrics and tech, you can check out one of our articles here https://www.metridev.com/metrics/quality-gates-everything-you-need-to-know/