benn.substack

Fair: https://twitter.com/bennstancil/status/1558125437909352448

As Ernest said, the title is a bit of an artist liberty. The DAG still needs to exist, but as an under-the-hood thing, not as the primary interface.

I had a bit that I cut for the final piece about how we tend to ship technical designs to users. If some problem needs a particular skeleton to solve (eg, a dag) we make an interface for that skeleton. In some cases, that works; in other cases, it's just confusing. DAGs are the latter for me. (I'd argue we also do this with olap cubes: https://benn.substack.com/p/ghosts-in-the-data-stack)

Expand full comment

Ernest Prabhakar

I think the point is that we shouldn’t treat the DAG as the primary abstraction. Rather, I specify what I care about for this particular model, and the system is empowered to create as hoc DAGs as needed to generate that data. Very much the way C make files determine which source code to recompile.

Expand full comment

Ran

We are all saying the same thing. The DAG represents dependencies, not executions. We are just using a different decision / timing system to decide when to execute each arc.

Expand full comment

Ernest Prabhakar

I agree, but I think the big shift Benn is pitching is for users to NOT need to be aware of (work around) the DAG, but the DAG to dynamically reconfigure to support user expectations.

Expand full comment

Ran

I am not sure I follow. The dependencies are static, are they not? This means the DAG is also static. The execution of individual arcs in the DAG is the dynamically determined part, not the DAG itself, no? The system does not create dependencies on the fly. The way I see it, the centrality of the DAG as a representation of dependency and lineage and data flow s not diminished. What am I missing?

Expand full comment

Ernest Prabhakar

Ah, you hit the nail on the head. Yes, the DAG in theory is just a list of dependencies. However, in practice it does NOT contain information about the desired latency/freshness/deadline for each component, and thus the whole thing ends up being rerun en masse.

The call is to annotate the DAG with this additional metadata, so the system can intelligently determine the “minimal subset” that must be run to satisfy the users desires.

Expand full comment

Andrew Padilla

Sounds like you are proposing a declarative approach to data orchestration which would be similar to how kubernetes approaches container orchestration in that you declare. a desired state and the system figures out how to get to that desired state.

Expand full comment

Basically, yeah (though Kubernetes is the most confusing thing in the world to me, so I can't say if this and that are the same).

Expand full comment

George Pearse

Dagster does this. (and describes it with the same wording, Software Defined Assets). K8s is also the most confusing thing in the world to me.

Expand full comment

I've heard they have something similar. I need to look into it, clearly.

Expand full comment

Ted M

Aug 24, 2022

I am definitely on board with this. Intuitively I’ve felt this way for quite some time, but had to read your brilliance to articulate it.

I wish more people in data engineering had bash and C skill sets , or at least more Java. Plenty of patterns there for some transfer learning.

I still haven’t used dbt but I’m pretty sure the Oracle pattern the previous engineer built (code base I know own) wrote his own dbt with just sql and bash. With a lot more steps… but still. Impressive

Expand full comment

https://youtu.be/pRGNxIz6GzU

Aug 24, 2022

There are definitely lots of internal dbts (and internal Fivetrans, and internal Modes, and internal Segments, and...) floating around out there.

I'm not sure how you'd do it, but I bet it be interesting to see what people came up with for those internal tools that vendors haven't.

Expand full comment

Stas Sajin

Aug 23, 2022

I have some good news. A declarative pipeline approach like the one you’re describing already exists. It’s very popular and has a lot of features.

Expand full comment

Aug 24, 2022

Yeah, Dagster is the closest thing that I know of, though it's not quite exactly this (and layers on a bunch of other orchestration functionality/complexity that wouldn't strictly be needed for just a scheduler).

Expand full comment

Ken Chew

I like the idea of working backwards, and the analogy of how a passenger only cares about the departure/arrival time makes a lot of sense! I think this framework works great for scheduled, regular jobs.

But out of curiosity Benn, how did/would your reverse orchestration system deal with ad-hoc data dumps from production databases?

Expand full comment

I'm not sure I follow... what's an ad hoc data dump from a production database?

Expand full comment

Ken Chew

Maybe I can share an example here - we have a table (that's built incrementally with daily data) that has a column with weird responses because of a buggy web scraper. If I would like to run a one-time job to rebuild that day's data to fix that problem so that I can work on that now, rather than waiting till tomorrow, how would I do that in your system?

Expand full comment

Ah, gotcha. In both integritie and easybake, we had ways to "update this table now," which basically just ran the job and upstream jobs (if necessary). It'd operate the same as if you set the table's latency guarantee as some value less than however stale it was currently.

Removing the DAG oriented schedule didn't mean you couldn't do things manually as needed; it just meant the automatic updates happened in a different way.

Expand full comment

Simon Späti

Aug 17, 2022

Excellent article, as always, telling the perfect story. I fully agree and wrote about the same but called it "The Shift From Data Pipelines to Data Products". If anyone is interested, I believe it goes in-depth into what you wish for: https://airbyte.com/blog/data-orchestration-trends.

Expand full comment

Aug 17, 2022

Thanks! One thing I have to admit, though, is I still haven't fully gotten my head around what all of this orchestration / data mesh / data as a product / software defined asset stuff actually means. I get the abstract, high level idea of everything being more code-oriented, and the system being aware of more things, and automatically doing things, but I can never quite figure out how something like this actually works. As best I can tell, it seems like a DAG, but the node is function.

Expand full comment

Simon Späti

Aug 18, 2022

To me, it's the key to declarative pipelines, as you can declare a data asset/product without running anything. The SW-defined function is like a microservice, or as you said, just the function on a single asset (that can live independently). With more declarative metadata, the orchestrator will figure out the lineage, how to run, etc. The DAGs such as jobs/tasks/ops don't go away; there will always be a need for scheduling something (with jobs/tasks). But if you have an ML model that produces a BigQuery table, you can define upstream data sets. It depends on what might be created outside of your orchestrator by another team that does not need any DAG. That will be a single function (as you call it, or SW-Defined Asset). Not sure if that makes more sense, but that's how I see it, and that's quite revolutionary.

The best thing at the end, you get the actual data lineage of your physical assets, not an arbitrary lineage of tasks (that is interesting for engineers but not for data consumers).

Expand full comment

Aug 18, 2022

That makes sense, but seems...fairly basic? Instead of saying "run these five scripts sequentially," isn't that just saying "run these five scripts, and be more explicit about what the script returns"? I can see how that has potential, but in and of itself it doesn't do much.

There is an interesting parallel here with Integritie / Easybake here though. In both of those tools, models had two parts: The query *and* table DDL that defined the table's schema (with the CREATE TABLE and whatnot stripped out). I actually preferred this to dbt as well, because it made explicit what you were trying to create.

Is this more or less a "software defined asset" for dbt-type lineage? Where the asset each model creates is defined, as well as the "script" for how to create it?

(FWIW, I preferred that, because it kept the system more stable. But it wasn't a revolutionary improvement over what dbt does)

Expand full comment

Simon Späti

Sep 3, 2022

If you will, you can bring together a different team or tools from the modern data stack. The how it's created is abstracted and figured out by dagster; you only say what you want. I see it similar to Kubernetes, where you define everything declarative, and Kubernetes figures out how many pods to kill, etc., which you won't have in other tools.

I don't like to use revolutionary, or the best, or whatnot, but I think it's solving a significant pain, and it's a bit hard to understand, but I believe it's the future. I also wrote about the shift to data products and a declarative approach in my article.

Expand full comment