The data config

Jul 15, 2022

A humble YAML file, with ambitions for more.

26 Comments

Jul 15, 2022

This is basically the mashing of MDM/semantic layer and I like it. It would be great to have all the business rules/logic defined once (mostly by the business in a tool they can easily use) - and then accessible by every tool and process that needs it. I'd hate to see this buried in dbt, because it's useful scope is far beyond the analytics/DW teams.

Expand full comment

Reply (1)

Benn Stancil

Jul 15, 2022

As long as it’s exposed in some way, it seems like it could work. And that’s actually one of the benefits of doing this through something like a config file - it’s just a file. Better to live in that than in some application that you can only access through various proprietary APIs.

Expand full comment

Reply (1)

David Andersen

Jul 15, 2022

Agreed that the underlying implementation would be ideal as open, accessible files (and APIs!). But the interface for the business will need to be very good or it will not be adopted. If this just becomes another config layer for data/analytical engineers that they have to maintain (I know you're not arguing for this), I think it will fail. Or fall way short of what is possible. Though I suppose we're getting to the point where someone could stand up something in Retool/Bubble/etc. to be a decent UI.

Expand full comment

Reply (2)

Benn Stancil

Jul 15, 2022

Maybe, though running it through a service like that feels like it introduces a bunch of other weird dependencies. As complex as data stacks are becoming, I’d prefer to avoid as many of those as we can.

Expand full comment

Reply (1)

David Andersen

Jul 15, 2022

Yeah, for sure. I just think the 'do this in files' side of the implementation is the easy part. The hard part is 'make this easy for the business.'

When are you building this? :)

Expand full comment

Reply (1)

Benn Stancil

Jul 15, 2022

i’m just out here yelling at people on the street corner. nobody wants that guy building anything.

Expand full comment

Ernest Prabhakar

Jul 15, 2022

Okay, you forced my hand: https://ihack.us/2022/06/30/pipebook-yml-reimagining-notebooks-as-resilient-data-pipelines/

Email me at ernest.prabhakar@gmail.com and we’ll record a demo.

Expand full comment

Dave Mariani

Jul 15, 2022

This is a great product spec for a semantic (or metrics) layer. There's a ton of value in separating the business logic layer from the physical layer.

Expand full comment

Reply (1)

Benn Stancil

Jul 15, 2022

if nothing else, we seem to be moving pretty quickly in that direction, in one form or another.

Expand full comment

Ravi Dawar

Jul 16, 2022

Ben, the solution you are proposing via a config file already exists in the form of business reference data and there are tools to make that seamless and business friendly such as informatica reference 360 and ataccama one.

Expand full comment

Reply (1)

Benn Stancil

Jul 18, 2022

That doesn't surprise me - it seems like nearly everything in the new tool stack has some antecedent in prior generations of tools. Which I don't think is a bad thing. Most technology "inventions" are just updated versions of old stuff, but now, in the cloud, or using language X, or built with collaborative features, or whatever.

Expand full comment

RowanC

Jul 16, 2022

I was talking about something similar with a colleague recently. He described them like make-files for SQL. Though I think that’s not quite what we’re aiming for. Discovery has advantages in that it copes with different levels of skill, approaches to writing pipelines, and general human fallibility.

Expand full comment

Reply (1)

Benn Stancil

Jul 18, 2022

That feels a bit different. I'd assume make files are aimed at the tool builders, whereas this is more about discovery and accessibility. It's conceivable that the make file could be readable by anyone, though that seems tough to pull off.

Expand full comment

M van der Heijden

Aug 9, 2022Edited

Things indeed go really wrong if you put the reference data in hard code. To me such a Yaml file is a complex way of not putting tables in Excel and using those as groupings. One of the central reasons for the popularity of Excel is the simple creaton of tables with reference data for the fast and furious. The fact that Yaml is less accessible then Excel does not make it a better solution I think.

Expand full comment

Reply (1)

Benn Stancil

Aug 9, 2022

I can see that, though I'd argue that accessibility isn't always a good thing. If something needs to be consistent, I'd rather people have to go through some hoops to change it. It doesn't mean it should be walled off or gated; just that some friction can be a good thing.

Expand full comment

Reply (1)

M van der Heijden

Aug 10, 2022

Agree, friction is good. I still look for a tool that allows people to relatively easily set up some reference data that then can also be shared across an organisation and maintained in a useful manner.

Expand full comment

Michelangelo D'Agostino

Jul 18, 2022

On the engineering side, this is exactly what parameter stores do: https://docs.aws.amazon.com/systems-manager/latest/userguide/systems-manager-parameter-store.html. I've used the AWS Parameter store in the way you're decribing for ML projects. Seems like what you're talking about could be build on top of that, and then you get all of the API integration stuff out of the box from the AWS service.

Expand full comment

Reply (1)

Benn Stancil

Jul 18, 2022

This looks like basically it, but ideally without all the AWS console headache.

Expand full comment

Reply (1)

Michelangelo D'Agostino

Jul 18, 2022

Definitely. Someone should put a nicer frontend on it.

Expand full comment

Rick Marshall

Jul 18, 2022

This is exactly how we have built applications for 40 years.

https://unibase.zenucom.com does all the things you describe plus many more.

The big difference is that we don't use SQL, and I have technical reasons for that.

The Unibase data dictionary is required for the programs to run. It is the semantic layer that defines the tables, table relationships, and calculations with all sorts of built in functionality including arrays as a first order object. The list is very long. It also shows just how mush is needed if you want to build a true semantic layer.

The big thing, as you noted, is that things are defined in one place only rather than every instance.

Expand full comment

Reply (1)

Benn Stancil

Jul 18, 2022

So, I don't think of this particular thing as being a semantic layer, at least not in the traditional sense. Semantic layers have to incorporate all sorts of relationships; with this, my goal would be simpler: Just give people a place to define the constants. That's part of a semantic layer, I suppose, but only a small part.

Expand full comment

Reply (1)

Rick Marshall

Jul 18, 2022

If you'll excuse the pun, understanding what semantics is is the biggest problem.

When I started this journey I wasn't comfortable with the set representation used for relational databases. The discomfort comes from the idea that something devoid of anything except set theory is a very high entropy solution and while easy to define mathematically it is also difficult, impossible maybe, to add information and lower entropy.

In the SQL world the dictionary is embedded in the table and it reflects the hardware more than problem. Some set theory maths is used to do things, but as you pointed out there is little consistency and maintenance is a headache. Worse if you watch the discussions as programmers try to work out which join to use in which situation you realise that SQL joins might be more a programming problem than C pointers. Add to that the extreme difficulty of building significant optimisers and you have a problem.

Let me add here that the Codd's rules for a relational database are the things that matter in terms of table structures and remain critical to good design in spite of attempts to break them - denomalisation eg.

For the most part Unibase takes form and report layouts, refers to the data dictionary to get definitions of tables, calculations, and table relationships, and using this knowledge builds a plan to deliver the requested item. There's a lot to it but I would suggest there is enough semantic information that the most critical part of unit testing after verifying the calculations are correct can be done by a combination of text analysis and asking Unibase to explain how it is going to do something (every program has a signature).

By supporting array calculations, more or less like tensors, and returning aggregated (summed) values from other tables (think invoice totals in the header as the sum of calculations in lines) we have a very expressive language.

I have yet to see anything quite like it.

We also treat field types as groups, not sets. This allows Unibase to decide some operations are nonsense and complain. eg adding two dates is not meaningful, same for invoice numbers (which aren't numbers anyway).

How do I know this is significant. Simple really. Unibase can be used to build reliable applications that are far more complex than anything SQL can be used to build.

Finally, yes we have to have a way to build connected operations and that is done with scripts. They are usually very small and very focused.

Anyway, I enjoy blog. Please keep challenging everything.

Expand full comment

Reply (1)

Benn Stancil

Jul 19, 2022

It's a different piece of technology, but the folks at Relational AI are chasing a similar problem - or, probably more accurately, trying to better deal with SQL's shortcomings. https://relational.ai/

I think we already talked about this in a prior post, but the question for me is do these languages ever break through enough to overcome SQL's enormous inertia in the space. Even with all the hype around Hadoop et al ten years ago, SQL did fine (and came out the other side stronger). It's going to take a lot to shake people free of it, I think.

Expand full comment

Reply (1)

Rick Marshall

Jul 19, 2022

I agree. It's the reason that interests me. SQL is now a career choice. If you want a job as a database programmer you won't get past square 1 without SQL expertise. It doesn't matter how good the alternatives are. Those same people buy what they know. A deadly embrace. That's why the NoSQL movement has had to introduce SQL at some level

Fortunately neither my team nor my customers acre about SQL. Those that are aware of the causes of development failure are often keen to do something else and through an order of magnitude reduction in development costs and order of magnitude improvements in performance and development time the risk becomes worthwhile.

While this won't be a mass movement most of our work comes from companies who have tried the mainstream technologies and experienced failure, limitation, or excessive development times.

We can do better, but not if the motivations is job security or ego instead of customer satisfaction.

Expand full comment

Vlad

Jul 17, 2022

This is what more traditional analytics tools already do, although to be fair they lack in other aspects. With AEP exposing schemas, pipelines, SQL etc it seems the two approaches move towards converging at some point

Expand full comment

Reply (1)

Benn Stancil

Jul 18, 2022

Yeah, another commenter said something similar. As I said there, that doesn't surprise me - most data tools aren't really inventing anything new, as much as they're adapting old ideas to fit new architectures, into the new ecosystem of tools, and so on. If the prior versions didn't work, we should understand why. But if they did, great, let's make a new version for the new stuff.

Expand full comment

benn.substack

The data config