40 Comments

I am a pretendgineer (although I like to think I do a good job lol).

The way we work at my company is that a pretendgineer will build something useful (but not "optimized" or "efficient" or whatever) and then once data engineering sees that other people find it valuable, they will formally build a model themselves.

That seems to be a decent way to get over data engineering's hesitancy about taking on projects they don't totally understand and/or see the demand for, while not making the business beholden to convincing a couple people that something is a good idea.

Definitely not a perfect system, but I think it's good to have a bias toward building

Expand full comment

imo it's better to be a pretendgineer than a data analyst begging for support

Expand full comment

What about a data analyst solving the problem in a really inelegant manner? Many of us get the job done, however unpretty. Then again, perhaps we have simply been a pretendgineers forever. Remember, back in the day guys in finance or insurance having a title of "engineer" was quite rare, even if they had PhD's in engineering.

Expand full comment

I think this is what makes this tough: I agree that the bias towards building is probably good, but most dbt models don't really get rebuilt in a meaningful way.

Building crude apps and ETL-type pipelines (like the Slackbot I mentioned in the post) is probably fine, because they can serve as prototypes. If they become important, then yeah, they can get rebuilt properly.

For dbt models though, the "production engineers" are often the analysts or analytics engineers themselves. There's nobody coming back to rearchitect that stuff. So if things don't get built well on day 1 - partly because they're sometimes prototypes, but more because the people building them don't really know how to build it well - they stay that way. (That's not a criticism of those people, or really of dbt; it's just the general state of maturity of the space, I'd say.)

Expand full comment

maybe there should be a dedicated "optimization engineer" role focused on revisiting existing models

Definitely agree on the maturity of the space. We are currently working on building "best practices" for dbt and there isn't much material to reference

Expand full comment

I think we get there though. Though I'm generally skeptical that people get better at it through trainings and stuff; my bet on these things is almost always a pattern like this:

1. Something new comes out (dbt, Slack, cars, cell phones, anything)

2. We do a lousy job of using it and lots of people struggle.

3. We say "you need to learn how to use this better! A few people do, but most people keep struggling.

4. Someone makes a better version of the thing that is easier to use based on what people learned.

5. Now, everyone is good at it.

Expand full comment

That would be nice!

What immediately came to mind is Excel and I don't think it’s followed this pattern, but maybe it’s as good as such a sandbox application can be or there is an adoption issue with substitutable tech.

I do think even in other examples that do follow that pattern there is going to be a gap between the "better version" and the "best version" based on what the market demands. Our phones are great, but the best version of a phone has not been built. Same with Slack. dbt could be a case where incremental efficiency isn't valued enough to overcome the cost of switching to something better...we will see!

Expand full comment

Great post as always Benn!

I think you’re exactly right. We don’t have a Django or Ruby on Rails equivalent. We only have a … HTTPServer?

Modeling techniques like DataVault, Kimball, ActivitySchema aren’t it. They online describe the structure of the tables at the start or end of the pipeline. We need something that describes how to organize the code in the middle. And then a library that takes all the boilerplate out of doing it that way.

I hope that a DHH kind of white knight comes along and figures this all out. I’m honestly not holding my breath though - I think data pipelines are just inherently messy and imposing too much structure just doesn’t work. With a web app you’re always going to load your model before rendering a view; with a data pipeline there are no guarantees about the “best” order of operations.

I would love someone to prove me wrong though. The lack of a shared design framework is the #1 reason that large dbt projects invariably regress into shanty towns IMHO.

Expand full comment

“The problem is not one of technical knowledge but of organization. You know how to write the code but not where to put it.”

https://www.poodr.com/

Expand full comment

Thanks! And yeah, I don’t think anyone will come along and “solve” this or anything; I think you’re right that there is no “best” way to do it. But, I’d guess that some people have figured out some useful guardrails that they apply to their internal projects, and that’s what we could all benefit from. It’s not necessarily ideal, but it’s just another layer of best practices, with policies that enforce those best practices baked in. That may not solve things in some mathematical way, but it should hopefully make it a lot easier to organize the stuff in the middle relatively well.

Expand full comment

Benn great post - enjoyed this. Rise of the data engineer, analytics engineer, data platform engineer, data product manager - this is the best title yet so kudos

Expand full comment

This blog is nothing if not attempted clickbait.

Expand full comment

This is so true. I once built a monte carlo simulation inputting all this stuff from actuaries. I pretty much abstracted off their actuarial macros, so I called their variables and made it work, and the file, before adding data was six megabytes, with maybe 10-15% of it mine. This thing took five to thirty minutes to run depending on the data inputs.

I went to another firm and this guy with a PhD re-wrote it. His was like half a megabyte, ran in less than a second and he had all of these super cool transforms that were way out of my league. My history degree did not prepare me for my career as a data analyst. That guy was awesome, but I cannot imagine a world where every analyst has a PhD in math.

Over the last 25 years I have had multiple real programmers assure me that my stuff "should not work" but I got the right answers. In my early career I had access to hard-core hardware that saved me. Today, every computer is hard-core. And, no matter how bad my SQL queries are, they are still infinitely better than some beastly spreadsheet no one understands. At least with some kind of code one knows what the writer did. I frequently find spreadsheets with a Word document describing the process of how to make the spreadsheet work. My favorite had data with column headings that referred to the name of the officer instead of his role, except for the names were all twenty years out of date. It took a long time to figure that one out, so I replaced it with a well-notated SQL script, which will obviously piss off the next guy who actually knows what he is doing.

Expand full comment

Yeah, I think that's exactly the magic - and the problem - with dbt: It's sort of like Excel for data transformations, where you don't need anything close to a math PhD to use it. Which, is mostly good, but sometimes means stuff that could run in less than a second run in thirty minutes.

Which, is that better than not having something run at all? I'd say the answer is probably yes, but not *obviously* yes.

Expand full comment

I have had similar thoughts over the years around data modeling. One interesting recent thought came to me after working on some old xml data recently. The xml came with schema info from the origin system - and it had about 800 data points per parent object all crammed together with various nestings.

What if data sharing was a first party concern from apps? What if snowflake “sharing” was expected? You still have a problem modeling the next layer but atleast you have a better starting place.

Expand full comment

Meaning, what if you had a CRM or marketing tool or whatever, and the tool was built with exporting data out of it into a queryable place natively, as opposed to it needing to be pulled through a bunch of API backdoors?

Expand full comment

Yes - that.

Expand full comment

Word, yeah, we did a version of this at Mode (https://mode.com/developer/discovery-database/introduction/), and I wrote a bit about something like this a couple years ago (https://benn.substack.com/i/73615268/the-problem-is-better-solved-by-someone-else).

It doesn't seem to have taken off, and my guess is that it's because there's actually a mismatch between what customers want and what vendors want. Like, with Mode's version of this, we wanted to provide what we thought was useful, which often ended up being a fairly narrow dataset. But people want (or think they want) everything, raw. Basically, people want the modeled data, but they seem to want to model it themselves.

Expand full comment

Ya - would be interesting to give customers some options… like raw data + some opinionated modeled layer that was easily modifiable by the customer.

As far as BI tools I have always wanted more insight into user interaction and usage of the tool - wonder why that's always a secondary or tertiary concern. From the technical team it should be a primary concern just like product analytics matters to product teams.

Expand full comment

Yeah, I get that. Having been someone who could've given customers that data, it's honestly hard to justify doing it though. Giving granular data is always somewhat of a risk, because it's not going to always be perfectly accurate or will be confusing (eg, what counts as a report view), and people will nitpick it to death. You end up fielding so many tickets that are things like "I'm sure I looked at this dashboard but this data says I didn't so I don't trust anything you do anymore." It's a lot of headache for not that much upside.

More cynically, giving people that sort of data also means you lose some control of the narrative. If people don't have it, you can put together than really positive presentations about how much they use the product and how valuable is and all of that (people will say they aren't spinning it but are just interpreting it in the way that they've found most useful, but they're spinning it). Give people data, and they can start to come to their own conclusions.

Expand full comment

Ya - and at this point nobody doesn’t buy or doesn’t renew because of lack of good metadata.

Expand full comment

Sounds much like promptengineering! The more tech gets abstracted out more noise gets added than substance and then we "reinvent" things to make it the first principles way! I guess thats how this industry stays relevant!! Great read as always.

Expand full comment

Thanks! And yeah, there definitely seem like cycles to all of this. Make it good, put layers on top, make the layers complicated, break it down again, round and round we go.

Expand full comment

I started on web apps back in the day when we wrote our own database calls and Ruby on Rails was a revelation....and I definitely thought dbt was an exact correlation for data work (it does all the hard parts). We definitely saw the community step into the web app space and build some great scaffolding options so hoping the dbt community does the same! Perhaps dbt Cloud can have a toggle to run in "opinionated" mode and it will build out the project structure according to a set of best practices?

Expand full comment

Yeah, I think that's what I would want, be it in dbt directly or something that someone builds/open sources on top of it. And to your point, you could say that dbt itself is already that opinionated layer (ie, dbt is to SQL as Rails is to Ruby). Which, I think is kinda right - but if it is, we might need another layer on top.

Expand full comment

This is to me the real essential thing, in the end it is not 'opiniated layer or not', to some extent everything is an opiniated layer. . You have from a user perspective the 'raw' / default options, like SQL, or Python or R, but you can already argue that these are opiniated layers. Then you have a set of layers (dbt, pandas, base R). These in turn can be layered (or mutated to become more opiniated). And when the 'opinion becomes too controversial' they can be switched (polars, tidyverse).

Funny stuff in this regard is for instance plotnine. Is a python datavis package (layer) that has the same syntax as ggplot2 a famous R datavis package. Of course the Python / R code beneath is also a layer for often C++, so in theory it might be that the same syntax is run via different interpreters to in the end the same zero's and one's..

Expand full comment

Fair, I don’t disagree that everything here has some form of opinion in it. But those opinions address different things, and I think that’s where we haven’t found the balance in this space yet.

As in, dbt has fairly strong opinions about SQL, about Python over R, about data models primarily being defined as a latticework of tables vs one big conceptual map, and so on. But I’d argue that it *doesn’t* have opinions about how you should architect that latticework. But doing that well is hard, and most of us do it pretty poorly.

That’s not dbt’s fault per se, though dbt bears the consequences of it, since dbt’s customers are going to blame the problems with their Rube Goldberg dbt project on dbt, even if they built the project. And that’s the opportunity, I think, is to see if there’s some way to inject some opinion somewhere that helps us build that latticework better.

Expand full comment

Yeah, now I understand it better. It is not just abstractions in layers on top of each other, but you can have layers in different directions and dbt should add an extra layer. So more boxes of abstractions than layers.

Expand full comment

Something like that. And opinions about how you should use the abstractions.

Expand full comment

Typo

"When I was stumbling my way through building a Django app, I constantly had questions about how to add new things to it, and I had no idea how to fit them into Django’s conventions. Beasue it’s 2024, "

Expand full comment

Ah, good catch, thanks!

Expand full comment

Thanks for the insights on non Microsoft Power BI tools.. I was an early Ruby on Rails dev and loved the framework. The hard thing about programming is naming things and organizing RoR did that. I was early fan of Rich Hickey who is genius programmer and fun speaker. His Clojure language remains the best in my opinion.

Expand full comment

I’m sure there’s some blog post out there about how half of all engineering problems are naming problems, and I’m sure it’s amazing.

Expand full comment

So we gonna build it then? Best Practice Analyzer for dbt. In my head it already starts looking like Clippy.

Expand full comment

Not at all. I don’t think it’s some sort of linter type thing that says “check this and tell me what I’ve done wrong.” It’s more something that imposes constraints on how you can use dbt. I’d imagine that there are people out there who’ve done this, and use dbt in what seems like a bit of an unorthodox way (they use macros all the time for stuff, or they’ve arranged their model folders in a very particular way, or they use some sort of model naming convention that automatically prevents intertwining logical layers of their project.)

For example, back in the day, Facebook at this internal dbt-like thing that people could use to create tables. They did, and people created a huge hot mess with it. It was really useful, but very quickly became unwieldy, because anyone could tack on another layer - “I’m gonna take random table A and random table B and make new table C.” And then people might use C, and so on.

dbt lets you do this, and people do. They start with a well-designed architecture, but over time, a lot of projects (at least from what I’ve seen) tend to grow a lot of this kind of hair. But you could imagine a framework that sits on top of dbt that makes this really hard to do, even with something as simple as giving tables types. If you want to make a table of type “terminus,” you cannot reference it in another model. Or if you want to make a new table, you have to explicitly mark it as “production” vs “scratch;” and scratch tables cannot be used in anything that’s upstream of production.

I’m making all this up, but this is the sort of thing I think we need. It’s not lineage checker where we have to find the problems and then re-arrange something according to best practices; it’s roads we’re not really allowed to drive off of.

Expand full comment

Yeah, I was thinking along those lines. Like checking of your new table already exists in the project with a different name, or if it has like 90% match (how you figure that match, I don't know) ...I haven't played with dbt mesh yet, but giving that visibility over multiple projects sounds like it could make it easier. Also sounds like an education piece and also a (dare I say the G word) governance piece. I feel as though some of this could/should sot outside of dbt too. Like if you were using dbt with Databricks, then a lot of that observability comes via Unity Catalog. Anyways, it sounds like a great idea and I have opinions on what sort of things such a framework would monitor, but zero idea how best to begin implementing it

Expand full comment

I'd go a good bit further than that, actually. Finding things like duplicates and naming collisions is good, but that's about helping people avoid problems they already know they need to avoid. It's lane assist - don't drift out of the lane! Which is useful, sure, but it doesn't really make me to be a better driver; it just tells helps me do something I want to do but sometimes can't.

I think we need something that instructs people on how to do things they don't know how to do. Stuff like requiring table types and not allowing you to intermingle them (or whatever) forces a structure that I may not have ever thought of on my own. It's putting lanes on the road where there aren't any. That's what would make me a better driver.

(But, fair, I don't know exactly what that would look like. Which is why I'd imagine something like this would come from someone who figured out a bunch of these things and applies them to some package-like thing on top of dbt.)

Expand full comment