27 Comments

The most brilliant article ever. Consider me a disciple.

I do have a quibble with the title though. You are (rightly!) calling for Pull based DAG execution instead of the Push execution (as in dbt run). The DAG itself is still invaluable, the problem is the push (vs. pull) orchestration model.

Expand full comment

Fair: https://twitter.com/bennstancil/status/1558125437909352448

As Ernest said, the title is a bit of an artist liberty. The DAG still needs to exist, but as an under-the-hood thing, not as the primary interface.

I had a bit that I cut for the final piece about how we tend to ship technical designs to users. If some problem needs a particular skeleton to solve (eg, a dag) we make an interface for that skeleton. In some cases, that works; in other cases, it's just confusing. DAGs are the latter for me. (I'd argue we also do this with olap cubes: https://benn.substack.com/p/ghosts-in-the-data-stack)

Expand full comment

I think the point is that we shouldn’t treat the DAG as the primary abstraction. Rather, I specify what I care about for this particular model, and the system is empowered to create as hoc DAGs as needed to generate that data. Very much the way C make files determine which source code to recompile.

Expand full comment

We are all saying the same thing. The DAG represents dependencies, not executions. We are just using a different decision / timing system to decide when to execute each arc.

Expand full comment

I agree, but I think the big shift Benn is pitching is for users to NOT need to be aware of (work around) the DAG, but the DAG to dynamically reconfigure to support user expectations.

Expand full comment

I am not sure I follow. The dependencies are static, are they not? This means the DAG is also static. The execution of individual arcs in the DAG is the dynamically determined part, not the DAG itself, no? The system does not create dependencies on the fly. The way I see it, the centrality of the DAG as a representation of dependency and lineage and data flow s not diminished. What am I missing?

Expand full comment

Ah, you hit the nail on the head. Yes, the DAG in theory is just a list of dependencies. However, in practice it does NOT contain information about the desired latency/freshness/deadline for each component, and thus the whole thing ends up being rerun en masse.

The call is to annotate the DAG with this additional metadata, so the system can intelligently determine the “minimal subset” that must be run to satisfy the users desires.

Expand full comment

Sounds like you are proposing a declarative approach to data orchestration which would be similar to how kubernetes approaches container orchestration in that you declare. a desired state and the system figures out how to get to that desired state.

Expand full comment

Basically, yeah (though Kubernetes is the most confusing thing in the world to me, so I can't say if this and that are the same).

Expand full comment

Dagster does this. (and describes it with the same wording, Software Defined Assets). K8s is also the most confusing thing in the world to me.

Expand full comment

I've heard they have something similar. I need to look into it, clearly.

Expand full comment

I am definitely on board with this. Intuitively I’ve felt this way for quite some time, but had to read your brilliance to articulate it.

I wish more people in data engineering had bash and C skill sets , or at least more Java. Plenty of patterns there for some transfer learning.

I still haven’t used dbt but I’m pretty sure the Oracle pattern the previous engineer built (code base I know own) wrote his own dbt with just sql and bash. With a lot more steps… but still. Impressive

Expand full comment

There are definitely lots of internal dbts (and internal Fivetrans, and internal Modes, and internal Segments, and...) floating around out there.

I'm not sure how you'd do it, but I bet it be interesting to see what people came up with for those internal tools that vendors haven't.

Expand full comment

I have some good news. A declarative pipeline approach like the one you’re describing already exists. It’s very popular and has a lot of features.

https://youtu.be/pRGNxIz6GzU

Expand full comment

Yeah, Dagster is the closest thing that I know of, though it's not quite exactly this (and layers on a bunch of other orchestration functionality/complexity that wouldn't strictly be needed for just a scheduler).

Expand full comment

I like the idea of working backwards, and the analogy of how a passenger only cares about the departure/arrival time makes a lot of sense! I think this framework works great for scheduled, regular jobs.

But out of curiosity Benn, how did/would your reverse orchestration system deal with ad-hoc data dumps from production databases?

Expand full comment

I'm not sure I follow... what's an ad hoc data dump from a production database?

Expand full comment

Maybe I can share an example here - we have a table (that's built incrementally with daily data) that has a column with weird responses because of a buggy web scraper. If I would like to run a one-time job to rebuild that day's data to fix that problem so that I can work on that now, rather than waiting till tomorrow, how would I do that in your system?

Expand full comment

Ah, gotcha. In both integritie and easybake, we had ways to "update this table now," which basically just ran the job and upstream jobs (if necessary). It'd operate the same as if you set the table's latency guarantee as some value less than however stale it was currently.

Removing the DAG oriented schedule didn't mean you couldn't do things manually as needed; it just meant the automatic updates happened in a different way.

Expand full comment

Excellent article, as always, telling the perfect story. I fully agree and wrote about the same but called it "The Shift From Data Pipelines to Data Products". If anyone is interested, I believe it goes in-depth into what you wish for: https://airbyte.com/blog/data-orchestration-trends.

Expand full comment

Thanks! One thing I have to admit, though, is I still haven't fully gotten my head around what all of this orchestration / data mesh / data as a product / software defined asset stuff actually means. I get the abstract, high level idea of everything being more code-oriented, and the system being aware of more things, and automatically doing things, but I can never quite figure out how something like this actually works. As best I can tell, it seems like a DAG, but the node is function.

Expand full comment

To me, it's the key to declarative pipelines, as you can declare a data asset/product without running anything. The SW-defined function is like a microservice, or as you said, just the function on a single asset (that can live independently). With more declarative metadata, the orchestrator will figure out the lineage, how to run, etc. The DAGs such as jobs/tasks/ops don't go away; there will always be a need for scheduling something (with jobs/tasks). But if you have an ML model that produces a BigQuery table, you can define upstream data sets. It depends on what might be created outside of your orchestrator by another team that does not need any DAG. That will be a single function (as you call it, or SW-Defined Asset). Not sure if that makes more sense, but that's how I see it, and that's quite revolutionary.

The best thing at the end, you get the actual data lineage of your physical assets, not an arbitrary lineage of tasks (that is interesting for engineers but not for data consumers).

Expand full comment

That makes sense, but seems...fairly basic? Instead of saying "run these five scripts sequentially," isn't that just saying "run these five scripts, and be more explicit about what the script returns"? I can see how that has potential, but in and of itself it doesn't do much.

There is an interesting parallel here with Integritie / Easybake here though. In both of those tools, models had two parts: The query *and* table DDL that defined the table's schema (with the CREATE TABLE and whatnot stripped out). I actually preferred this to dbt as well, because it made explicit what you were trying to create.

Is this more or less a "software defined asset" for dbt-type lineage? Where the asset each model creates is defined, as well as the "script" for how to create it?

(FWIW, I preferred that, because it kept the system more stable. But it wasn't a revolutionary improvement over what dbt does)

Expand full comment

If you will, you can bring together a different team or tools from the modern data stack. The how it's created is abstracted and figured out by dagster; you only say what you want. I see it similar to Kubernetes, where you define everything declarative, and Kubernetes figures out how many pods to kill, etc., which you won't have in other tools.

I don't like to use revolutionary, or the best, or whatnot, but I think it's solving a significant pain, and it's a bit hard to understand, but I believe it's the future. I also wrote about the shift to data products and a declarative approach in my article.

Expand full comment

From what I can tell, that seems like step 2 that doesn't yet exist though. Like,

Step 1: Define a DAG in the normal way with a bunch of procedural tasks. But for each task, specify the structure of the output. That's cool (and what Integritie did).

Step 2: Now that the system knows the desired output and can figure out the procedure. That's not possible without step 1, but you can definitely have step 1 without step 2. It seems like Dagster/software defined assets does step 1 now, but step 2 is still aspirational.

Expand full comment

Absolutely fantastic read! Your insights on rethinking data orchestration are spot on. For those who find the intricacies of orchestrators daunting, I highly recommend checking out our guide on orchestrators in data warehouses here:

https://dlthub.com/blog/first-data-warehouse

In our exploration, we delve into various orchestrators that aim to simplify the complexities you’ve highlighted. It's interesting to see the parallels between your suggested pull-based approach and our focus on managed solutions to alleviate the burdens of orchestration. The conversation around evolving these systems to be more efficient and user-friendly is incredibly timely. Kudos to you for shining a spotlight on this crucial topic!

Best,

Aman Gupta,

DLT Team.

Expand full comment

This is a great read, thank you for sharing the info. We also write about metrics and tech, you can check out one of our articles here https://www.metridev.com/metrics/quality-gates-everything-you-need-to-know/

Expand full comment