26 Comments

I have had similar thoughts over the years around data modeling. One interesting recent thought came to me after working on some old xml data recently. The xml came with schema info from the origin system - and it had about 800 data points per parent object all crammed together with various nestings.

What if data sharing was a first party concern from apps? What if snowflake “sharing” was expected? You still have a problem modeling the next layer but atleast you have a better starting place.

Expand full comment
author

Meaning, what if you had a CRM or marketing tool or whatever, and the tool was built with exporting data out of it into a queryable place natively, as opposed to it needing to be pulled through a bunch of API backdoors?

Expand full comment

Yes - that.

Expand full comment
author
Jun 12·edited Jun 14Author

Word, yeah, we did a version of this at Mode (https://mode.com/developer/discovery-database/introduction/), and I wrote a bit about something like this a couple years ago (https://benn.substack.com/i/73615268/the-problem-is-better-solved-by-someone-else).

It doesn't seem to have taken off, and my guess is that it's because there's actually a mismatch between what customers want and what vendors want. Like, with Mode's version of this, we wanted to provide what we thought was useful, which often ended up being a fairly narrow dataset. But people want (or think they want) everything, raw. Basically, people want the modeled data, but they seem to want to model it themselves.

Expand full comment

Ya - would be interesting to give customers some options… like raw data + some opinionated modeled layer that was easily modifiable by the customer.

As far as BI tools I have always wanted more insight into user interaction and usage of the tool - wonder why that's always a secondary or tertiary concern. From the technical team it should be a primary concern just like product analytics matters to product teams.

Expand full comment
author

Yeah, I get that. Having been someone who could've given customers that data, it's honestly hard to justify doing it though. Giving granular data is always somewhat of a risk, because it's not going to always be perfectly accurate or will be confusing (eg, what counts as a report view), and people will nitpick it to death. You end up fielding so many tickets that are things like "I'm sure I looked at this dashboard but this data says I didn't so I don't trust anything you do anymore." It's a lot of headache for not that much upside.

More cynically, giving people that sort of data also means you lose some control of the narrative. If people don't have it, you can put together than really positive presentations about how much they use the product and how valuable is and all of that (people will say they aren't spinning it but are just interpreting it in the way that they've found most useful, but they're spinning it). Give people data, and they can start to come to their own conclusions.

Expand full comment
Jun 15·edited Jun 15Liked by Benn Stancil

Ya - and at this point nobody doesn’t buy or doesn’t renew because of lack of good metadata.

Expand full comment

Great post as always Benn!

I think you’re exactly right. We don’t have a Django or Ruby on Rails equivalent. We only have a … HTTPServer?

Modeling techniques like DataVault, Kimball, ActivitySchema aren’t it. They online describe the structure of the tables at the start or end of the pipeline. We need something that describes how to organize the code in the middle. And then a library that takes all the boilerplate out of doing it that way.

I hope that a DHH kind of white knight comes along and figures this all out. I’m honestly not holding my breath though - I think data pipelines are just inherently messy and imposing too much structure just doesn’t work. With a web app you’re always going to load your model before rendering a view; with a data pipeline there are no guarantees about the “best” order of operations.

I would love someone to prove me wrong though. The lack of a shared design framework is the #1 reason that large dbt projects invariably regress into shanty towns IMHO.

Expand full comment

“The problem is not one of technical knowledge but of organization. You know how to write the code but not where to put it.”

https://www.poodr.com/

Expand full comment
author

Thanks! And yeah, I don’t think anyone will come along and “solve” this or anything; I think you’re right that there is no “best” way to do it. But, I’d guess that some people have figured out some useful guardrails that they apply to their internal projects, and that’s what we could all benefit from. It’s not necessarily ideal, but it’s just another layer of best practices, with policies that enforce those best practices baked in. That may not solve things in some mathematical way, but it should hopefully make it a lot easier to organize the stuff in the middle relatively well.

Expand full comment

Sounds much like promptengineering! The more tech gets abstracted out more noise gets added than substance and then we "reinvent" things to make it the first principles way! I guess thats how this industry stays relevant!! Great read as always.

Expand full comment
author

Thanks! And yeah, there definitely seem like cycles to all of this. Make it good, put layers on top, make the layers complicated, break it down again, round and round we go.

Expand full comment
Jun 3·edited Jun 3Liked by Benn Stancil

I started on web apps back in the day when we wrote our own database calls and Ruby on Rails was a revelation....and I definitely thought dbt was an exact correlation for data work (it does all the hard parts). We definitely saw the community step into the web app space and build some great scaffolding options so hoping the dbt community does the same! Perhaps dbt Cloud can have a toggle to run in "opinionated" mode and it will build out the project structure according to a set of best practices?

Expand full comment
author

Yeah, I think that's what I would want, be it in dbt directly or something that someone builds/open sources on top of it. And to your point, you could say that dbt itself is already that opinionated layer (ie, dbt is to SQL as Rails is to Ruby). Which, I think is kinda right - but if it is, we might need another layer on top.

Expand full comment
Jul 9Liked by Benn Stancil

This is to me the real essential thing, in the end it is not 'opiniated layer or not', to some extent everything is an opiniated layer. . You have from a user perspective the 'raw' / default options, like SQL, or Python or R, but you can already argue that these are opiniated layers. Then you have a set of layers (dbt, pandas, base R). These in turn can be layered (or mutated to become more opiniated). And when the 'opinion becomes too controversial' they can be switched (polars, tidyverse).

Funny stuff in this regard is for instance plotnine. Is a python datavis package (layer) that has the same syntax as ggplot2 a famous R datavis package. Of course the Python / R code beneath is also a layer for often C++, so in theory it might be that the same syntax is run via different interpreters to in the end the same zero's and one's..

Expand full comment
author

Fair, I don’t disagree that everything here has some form of opinion in it. But those opinions address different things, and I think that’s where we haven’t found the balance in this space yet.

As in, dbt has fairly strong opinions about SQL, about Python over R, about data models primarily being defined as a latticework of tables vs one big conceptual map, and so on. But I’d argue that it *doesn’t* have opinions about how you should architect that latticework. But doing that well is hard, and most of us do it pretty poorly.

That’s not dbt’s fault per se, though dbt bears the consequences of it, since dbt’s customers are going to blame the problems with their Rube Goldberg dbt project on dbt, even if they built the project. And that’s the opportunity, I think, is to see if there’s some way to inject some opinion somewhere that helps us build that latticework better.

Expand full comment

Typo

"When I was stumbling my way through building a Django app, I constantly had questions about how to add new things to it, and I had no idea how to fit them into Django’s conventions. Beasue it’s 2024, "

Expand full comment
author

Ah, good catch, thanks!

Expand full comment
Jun 1Liked by Benn Stancil

Thanks for the insights on non Microsoft Power BI tools.. I was an early Ruby on Rails dev and loved the framework. The hard thing about programming is naming things and organizing RoR did that. I was early fan of Rich Hickey who is genius programmer and fun speaker. His Clojure language remains the best in my opinion.

Expand full comment
author

I’m sure there’s some blog post out there about how half of all engineering problems are naming problems, and I’m sure it’s amazing.

Expand full comment

So we gonna build it then? Best Practice Analyzer for dbt. In my head it already starts looking like Clippy.

Expand full comment
author

Not at all. I don’t think it’s some sort of linter type thing that says “check this and tell me what I’ve done wrong.” It’s more something that imposes constraints on how you can use dbt. I’d imagine that there are people out there who’ve done this, and use dbt in what seems like a bit of an unorthodox way (they use macros all the time for stuff, or they’ve arranged their model folders in a very particular way, or they use some sort of model naming convention that automatically prevents intertwining logical layers of their project.)

For example, back in the day, Facebook at this internal dbt-like thing that people could use to create tables. They did, and people created a huge hot mess with it. It was really useful, but very quickly became unwieldy, because anyone could tack on another layer - “I’m gonna take random table A and random table B and make new table C.” And then people might use C, and so on.

dbt lets you do this, and people do. They start with a well-designed architecture, but over time, a lot of projects (at least from what I’ve seen) tend to grow a lot of this kind of hair. But you could imagine a framework that sits on top of dbt that makes this really hard to do, even with something as simple as giving tables types. If you want to make a table of type “terminus,” you cannot reference it in another model. Or if you want to make a new table, you have to explicitly mark it as “production” vs “scratch;” and scratch tables cannot be used in anything that’s upstream of production.

I’m making all this up, but this is the sort of thing I think we need. It’s not lineage checker where we have to find the problems and then re-arrange something according to best practices; it’s roads we’re not really allowed to drive off of.

Expand full comment

Yeah, I was thinking along those lines. Like checking of your new table already exists in the project with a different name, or if it has like 90% match (how you figure that match, I don't know) ...I haven't played with dbt mesh yet, but giving that visibility over multiple projects sounds like it could make it easier. Also sounds like an education piece and also a (dare I say the G word) governance piece. I feel as though some of this could/should sot outside of dbt too. Like if you were using dbt with Databricks, then a lot of that observability comes via Unity Catalog. Anyways, it sounds like a great idea and I have opinions on what sort of things such a framework would monitor, but zero idea how best to begin implementing it

Expand full comment
author

I'd go a good bit further than that, actually. Finding things like duplicates and naming collisions is good, but that's about helping people avoid problems they already know they need to avoid. It's lane assist - don't drift out of the lane! Which is useful, sure, but it doesn't really make me to be a better driver; it just tells helps me do something I want to do but sometimes can't.

I think we need something that instructs people on how to do things they don't know how to do. Stuff like requiring table types and not allowing you to intermingle them (or whatever) forces a structure that I may not have ever thought of on my own. It's putting lanes on the road where there aren't any. That's what would make me a better driver.

(But, fair, I don't know exactly what that would look like. Which is why I'd imagine something like this would come from someone who figured out a bunch of these things and applies them to some package-like thing on top of dbt.)

Expand full comment