A humble YAML file, with ambitions for more.
This is basically the mashing of MDM/semantic layer and I like it. It would be great to have all the business rules/logic defined once (mostly by the business in a tool they can easily use) - and then accessible by every tool and process that needs it. I'd hate to see this buried in dbt, because it's useful scope is far beyond the analytics/DW teams.
This is a great product spec for a semantic (or metrics) layer. There's a ton of value in separating the business logic layer from the physical layer.
Ben, the solution you are proposing via a config file already exists in the form of business reference data and there are tools to make that seamless and business friendly such as informatica reference 360 and ataccama one.
I was talking about something similar with a colleague recently. He described them like make-files for SQL. Though I think that’s not quite what we’re aiming for. Discovery has advantages in that it copes with different levels of skill, approaches to writing pipelines, and general human fallibility.
Things indeed go really wrong if you put the reference data in hard code. To me such a Yaml file is a complex way of not putting tables in Excel and using those as groupings. One of the central reasons for the popularity of Excel is the simple creaton of tables with reference data for the fast and furious. The fact that Yaml is less accessible then Excel does not make it a better solution I think.
On the engineering side, this is exactly what parameter stores do: https://docs.aws.amazon.com/systems-manager/latest/userguide/systems-manager-parameter-store.html. I've used the AWS Parameter store in the way you're decribing for ML projects. Seems like what you're talking about could be build on top of that, and then you get all of the API integration stuff out of the box from the AWS service.
This is exactly how we have built applications for 40 years.
https://unibase.zenucom.com does all the things you describe plus many more.
The big difference is that we don't use SQL, and I have technical reasons for that.
The Unibase data dictionary is required for the programs to run. It is the semantic layer that defines the tables, table relationships, and calculations with all sorts of built in functionality including arrays as a first order object. The list is very long. It also shows just how mush is needed if you want to build a true semantic layer.
The big thing, as you noted, is that things are defined in one place only rather than every instance.
This is what more traditional analytics tools already do, although to be fair they lack in other aspects. With AEP exposing schemas, pipelines, SQL etc it seems the two approaches move towards converging at some point