34 Comments

Benn, I think you are saying that a universal semantic layer needs to be truly universal to be useful - I couldn't agree more. As a founder of AtScale, I've been shouting that from the high heavens and it's why AtScale support's multiple inbound query protocols including SQL, MDX, DAX, Python and REST. Supporting only SQL (or something SQL-like) is a no-go in my experience because as soon as a semantic layer stops being universal, it's not worth the cost or time to implement. In my opinion, a universal semantic layer needs to support the business analyst, data scientist and application developer personas which I believe covers enough of the user spectrum to be worth the effort.

Expand full comment

Have y'all ever thought about open sourcing parts of the semantic layer powering AtScale? I would love to see Malloy, dbt-metrics/metricflow, AtScale, cube.js, etc. all converge towards the same universal syntax if possible, similar to how we have ANSI SQL but different flavors like sparksql, bigquery, postgresql etc.

Expand full comment

I think that makes sense, but still feels like it's fairly constraining? To me, having a bunch of APIs is better than one API, but the problem is still what the semantic layer can do. If it's still basically an OLAP cube, *that's* the problem, not the language of its APIs. As I see it, what potentially limits semantic layers, and makes them more BI tool than semantic layers, is that they only support a handful of types of calculations. But we analyze data doing a lot of things that aren't OLAP cube-style pivots, so if a semantic layer can't also handle those well, it'll never be universal.

(For transparency, I've never actually seen AtScale so I don't know exactly what y'all can or can't do, so maybe this criticism doesn't apply.)

Expand full comment

per annotation one, why are knowledge graphs never discussed on this topic? They seem to address the universal addressability and builtin semantic expressiveness needed as a first level concern. What I read about in regards to semantic layers would seem more fitting to be referred to as semantic veneer.

Expand full comment

My guess is two somewhat related reasons:

1. Implementing knowledge graphs, at least as I understand them, is somewhat complex and esoteric, whereas relational modeling is more mainstream. The latter is therefore much easier to talk about; plus, simplicity is a draw. This is similar to the point about Malloy. Even if knowledge graphs are technically better, people also need to understand and want to use something too.

2. Per the Jared Kushner post from a couple weeks ago, a lot of the people in the community (me included) are uninformed about the industry's history and don't actually know anything about knowledge graphs.

Expand full comment

Annotation 6 is something that deserves a blog post of its own. Even within R, the slightly different syntax of data.tables vs. dplyr stack circa early 2010s clearly made me reason through data differently. These days my coworkers and I often have differences in opinion about the importance of reusable intermediate tables when I'm pairing up with someone who does not use DBT daily.

In this sense maybe linguistic relativity is real.

Expand full comment

Yeah, I've thought about this a lot. It's one of those things that I haven't quite figured out how to express, but it's very easy to *feel*. The way that I thought about what data even was when I was using Excel and some R was very different than how I reasoned through things once I learned SQL. But I've always found it hard to put that exact feeling into words.

Expand full comment

All I want* is reusable components across tools, e.g. Mode's definitions & SiSense's snippets in other tools, R/Python functions in accessible libraries with interfaces from Python/R. Allow me to use the right tool for the job at hand and make me confident that when I use these components I am using the correct formula/definition -- or at least the same faulty one that the rest of the team is using.

*clearly a trivial small ask for a variety of companies with differing incentives to collaborate on.

Expand full comment

So I fully agree with this, but think the components should be a much more expansive list. To me, it's also things like Malloy and Motif, or visual components, or spreadsheet like interactions.

(I have a much longer set of notes on this about how I think we kinda messed up what a componentized data stack could've been. We built it as a bunch of narrow products that step on each other's toes and all repeat a dozen things. We all build connectivity, and graphs, and a query interface, and so on. We would've been better off making those things the components, where we're all car *part* manufacturers, instead of being car manufacturers with our one special part.)

Expand full comment

My gut is that this complexity gets shifted downstream when it should be solved higher up. It may not be possible but product and engineering teams that are generating this data should be the ones thinking about the downstream implications and acting accordingly. Be more thoughtful and intentional when the data is being emitted and you solve the problem for downstream use cases. It feels the industry is letting the data be pushed however and then a variety of teams and tools are forced to clean up the mess.

I get that it’s more complicated to do this and breaks the workflow of larger companies but more work needs to be done at the data generation level.

Expand full comment

I don't think that's possible for the issues here though. It's one thing to push data quality up to the data producers, which is probably "right," if not still hard to implement. But data producers can't actually own any sort of logical modeling. If I'm an engineering team that's logging purchases via Stripe, I can make that data very clean and reliable; I can't tell a finance team how to convert that into properly reported on revenue. That's not a mess for the finance team to clean up; that's the finance team's job.

Expand full comment

That's fair but I still can't help but think that it's a balancing act and both sides need to understand the constraints of the other side. In your example the engineering team can understand what the finance team is trying to do and make it easier for them vs offloading it entirely to them. This may be publishing additional fields, or different values within the same field, or even a duplicate set of fields for different use cases. It's a series of tradeoffs but the conversation needs to happen - otherwise I've seen too much of externalizing costs to other teams.

Expand full comment

I get that, though unless data is being used for something business critical (eg, as a production pipeline back into the product or something), I get engineering teams' reluctance to want to commit to much. Try to be helpful, sure, their first, second, and third job is to build a good product. To me, doing a worse job of that to make it more convenient for people is an afterthought because it should be an afterthought.

Expand full comment

Semantics is a whole field, and usually, the name for trying to have computers do semantics is "ontology". The problem is approached by three ends that I know of:

- academia, building all sorts of ontology tools, but unfortunately, botching some of them so hard that the whole field is delayed by a decade (see https://terminusdb.com/blog/the-semantic-web-is-dead/)

- data integration, finding ontologies and knowledge graphs useful and rebranding them as "data fabric"

- analytics, coming from the metrics side.

It’s all starting to converge, as shown here: https://www.linkedin.com/posts/chad-sanderson_im-very-happy-to-unveil-the-semantic-warehouse-activity-6958091220157964288-JSXj/

Expand full comment

I've read this post a couple times and still don't think I really understand the it? It seems like a good list of capabilities, but how does any of it work?

Expand full comment

If I understand it correctly, bear with me for some probably terrible examples:

- the semantic layer has a list of Entities such as Customer and Product, Events such as Purchase, and relationships between them (a Customer performs a Purchase, a Purchase includes a Product…)

- the mapping allows to integrate data at all ETL stages: this table here coming in from Salesforce contains some Customer data, with the "fullname" field corresponding to my semantic "firstname" and "lastname" properties; when I split it with dbt, there are now two fields each matching to one semantic property…

- the metrics layer defines metrics once ("monthly revenue") and relates them to semantic entities and events ("monthly revenue comes from sales, but sales include products, which means that I can compute monthly revenue by product")

- all schemas of all these data stages are stored in a data catalog and used to check that the data produced matches expectations (data contract)

- the BI tool can show the semantic meaning of everything, plug in to denormalized data from the data mart stage and compute metrics from it, but also follow semantic relationships to view more if need be (either as a knowledge graph or on-the-fly data modeling? I don’t know.)

So, semantic definitions are used at least twice (data integration and metrics/BI), maybe more once data analysis tools adapt. It’s not shown here but you could even use it to define access control.

Expand full comment

So in theory, I think some setup like this makes sense, if nothing else because it can operate as a filter that narrows down some giant list of raw tables that get dumped out of SFDC, etc into a much smaller set of key business tables. That said, I think that's a little bit less about capital S Semantics (which is really hard to get right, as evidenced by the fact we've been talking about it for decades without much progress) and is much more about creating data that's "production" is a much more clearly delineated way.

I think that's one of the big problems with how many data stacks work today, is there's no hard line between production data that's reliably maintained and data that's not. I'm not a huge believer in some of the details about data contracts and this sort of semantic warehouse (as best I understand it, anyway), but I do think they could be very useful as markers of what is production.

Expand full comment

My understanding of some of the uses for semantic annotation is that you can use it to automatically pull in some rules for your data. For instance, if your field "ssnum" is semantically mapped to a SSN, you automatically get the equivalent of a PII tag, which reverberate to data suppression rules, dataset access control or column-level masking, etc. You can also derive data quality checks about the format and maybe unicity. Going even further, you could have rules about someone not being their own manager, etc.

Expand full comment

That sounds nice, but also kinda hocus pocus. I feel like we've been sold some version of that promise for long enough that I've lost all faith that anyone will ever pull it off.

Expand full comment

To your point on footnote #6, just another argument that it's people at the core of the "problem" as it were. The reason we use the term "semantic layer" in everything we're talking about here is because it is a means of communication, which is of course invented by people. (Or humanity if we want to go so far) How we conceptualise anything can be largely driven by the languages we speak, both linguistically and computationally. In every non-English language, there are things that cannot be translated in English, or explained as effectively in the source language. How we arrive at conclusions isn't universal, so I think the idea of a "Universal semantic layer" is incredibly difficult to achieve.

Expand full comment

I think that gets a little at what a question at the core of this, which is, what's the point of the semantic layer in the first place? If it's to create a bunch of metrics and stuff, then 1) that seems doable, but 2) is basically BI. But if it's to encode a model of how the world works, then yeah, that's nearly impossible.

This blog post, which is from one of the leaders in AI, has stuck with me on this point. In talking about trying to build AI models, he says "the bitter lesson is that the actual contents of minds are tremendously, irredeemably complex; we should stop trying to find simple ways to think about the contents of minds... All these are part of the arbitrary, intrinsically-complex, outside world. They are not what should be built in, as their complexity is endless."

http://www.incompleteideas.net/IncIdeas/BitterLesson.html

If semantic layers are here to capture that complexity, it seems like the game is lost.

Expand full comment

Yes exactly - I think that's why I get a little squirrel-y with the "One source of the truth" cliché that has been running riot in the BI world since, well its inception. In the context of trying to capture that complexity, this is impossible, so why are people/companies pouring a tremendous amount of capital into trying to make that happen?

Expand full comment

Yeah, I think my somewhat cynical answer to that is it sells. Single sources of truth *sound* really good, so it's a perpetually pretty effective marketing message.

Expand full comment

Thanks, Benn!

Expand full comment

Ok. I think I see the issue. We are conflating OLAP's *calculation* capabilities with the 1990s implementation of materializing a physical cube (AtScale does NOT build cubes). What makes OLAP (and spreadsheets) so powerful is that they are *cell-based* calculation engines, not row-based engines like relational databases. That's why SQL can't be the only dialect a semantic layer supports. I argue that a multidimensional calculation engine is required to handle a wider variety of use cases and many of those require *cell-based* calculations. For example, functions like Previous, Next, Lead, Lag, Parallel Period, Ancestor, Parent, Descendants, Children, Siblings are all examples of cell-based calculations that may be required to express complex calculations in a semantic layer. I would love to show you how AtScale does this for a billion of rows of data.

Expand full comment

I think I half follow that? I get how cell-based is more flexible than row-based, though I'm not sure that fully resolves it for me. At this point, I suspect seeing it would make it make more sense to me.

Expand full comment

At present (likely to change with the next substack I read) my data world is comprised of entities, the things in a particular domain that we care about, and events, the things that happen to those entities. These concepts have been pervasive over the years through various "flavor of the decade" data tools - facts, dimensions, business objects, models, stars, snowflakes, and even activity schemas. When we combine those in certain ways and apply very basic or very complex maths to them we yield measures (or metrics, or features). At best, our data models and semantic layers provide a map for navigating the relationships between our entities, events, and measures. And we often raise up certain combinations with special titles such that our circle of data colleagues (and their tools of choice) have a short-hand for referencing them (I'm talking about you "community-adjusted EBITDA before growth investments"). So perhaps our current data lingua-franca is to blame. Perhaps it lacks the expressiveness we require even if we appreciate its approachability. But what are we to do then if we yearn for something that is universal in both its analytic applicability and usability?

I don't know either Benn.

But if I were to try (better yet someone much more capable than me), I think I would start by building everything from entities, events, and measures. Everything else would simply be an intermediary, existing only for convenience of communication or speed of delivery.

Expand full comment

Yes this is pretty much what I had arrived at as well.

Entities and events which can be variously rearranged to produce metrics. A semantic layer sits atop all three.

https://alexpetralia.com/2023/03/23/the-sources-layer-schema-first-versus-schema-last-paradigms-part-4/

Expand full comment

I was just talking with someone tonight about this, and we landed on something sort of similar, where maybe you could do this with just entities and events? OLAP cubes can handle most things other than events, and maybe there's some blended combination is actually what we need. I'm not anywhere near enough of an expert to have any idea how to make foundational updates to something like Kimball though, so I'm mostly just making up ideas. But, it _seems_ feasbile?

Expand full comment

I’m familiar with SQL but not dbtCore

So I’m confused abt how the former encode computation but not entity while the latter encode entity and not computation

Can anyone give me examples so I can grok better? I’m more of a web developer

I mean I can create tables that represent entities in a RDBMS how is that not encoding entities? I’m confused

Expand full comment

So dbt Core is essentially a script that creates tables on a schedule. I'd argue that this encodes entities because it outputs entities - eg, you can use it to clean up a messy Stripe transaction log into a clean "purchases" table.

But what you can't do with dbt Core (at least not easily) is describe how you perform calculations on top of those purchases. How do you define monthly revenue, or the average purchase price by region? These are computations that you can't put in dbt Core.

You can put these in dbt's metrics features now, so they're adding computations. But what you can't do - and what's kind of the point of the post - is add specialized computations, like statistical calculations for a/b tests or sequence queries like what Motif adds. And without that, it feels like we're going to keep building data tools that add those things, and those tools will keep bypassing the computational semantic layers to do it.

Expand full comment

I am a big fan of the idea to have a semantic layer that is flexible enough to support blended querying. It seems like some flavor of SQL + Jinja that makes it easy to both pull well-defined metrics and augment them using ad-hoc queries against certified source tables will go a long way. If I am describing something that already exists or on the roadmap today, please point me in that direction so I can give them all my pocket money.

Expand full comment

As I understand it, this is similar to how dbt's metrics package works. I'm not sure if it's this is intended use (it might be a thing you can do but not what you're encouraged to do), nor am I sure if replacing it with Transform's MetricFlow will change it or not.

Expand full comment

On it. #UniversalSchema

Expand full comment