Taking stock of the new semantic layers.
Benn, I think you are saying that a universal semantic layer needs to be truly universal to be useful - I couldn't agree more. As a founder of AtScale, I've been shouting that from the high heavens and it's why AtScale support's multiple inbound query protocols including SQL, MDX, DAX, Python and REST. Supporting only SQL (or something SQL-like) is a no-go in my experience because as soon as a semantic layer stops being universal, it's not worth the cost or time to implement. In my opinion, a universal semantic layer needs to support the business analyst, data scientist and application developer personas which I believe covers enough of the user spectrum to be worth the effort.
per annotation one, why are knowledge graphs never discussed on this topic? They seem to address the universal addressability and builtin semantic expressiveness needed as a first level concern. What I read about in regards to semantic layers would seem more fitting to be referred to as semantic veneer.
Annotation 6 is something that deserves a blog post of its own. Even within R, the slightly different syntax of data.tables vs. dplyr stack circa early 2010s clearly made me reason through data differently. These days my coworkers and I often have differences in opinion about the importance of reusable intermediate tables when I'm pairing up with someone who does not use DBT daily.
In this sense maybe linguistic relativity is real.
All I want* is reusable components across tools, e.g. Mode's definitions & SiSense's snippets in other tools, R/Python functions in accessible libraries with interfaces from Python/R. Allow me to use the right tool for the job at hand and make me confident that when I use these components I am using the correct formula/definition -- or at least the same faulty one that the rest of the team is using.
*clearly a trivial small ask for a variety of companies with differing incentives to collaborate on.
My gut is that this complexity gets shifted downstream when it should be solved higher up. It may not be possible but product and engineering teams that are generating this data should be the ones thinking about the downstream implications and acting accordingly. Be more thoughtful and intentional when the data is being emitted and you solve the problem for downstream use cases. It feels the industry is letting the data be pushed however and then a variety of teams and tools are forced to clean up the mess.
I get that it’s more complicated to do this and breaks the workflow of larger companies but more work needs to be done at the data generation level.
Semantics is a whole field, and usually, the name for trying to have computers do semantics is "ontology". The problem is approached by three ends that I know of:
- academia, building all sorts of ontology tools, but unfortunately, botching some of them so hard that the whole field is delayed by a decade (see https://terminusdb.com/blog/the-semantic-web-is-dead/)
- data integration, finding ontologies and knowledge graphs useful and rebranding them as "data fabric"
- analytics, coming from the metrics side.
It’s all starting to converge, as shown here: https://www.linkedin.com/posts/chad-sanderson_im-very-happy-to-unveil-the-semantic-warehouse-activity-6958091220157964288-JSXj/
To your point on footnote #6, just another argument that it's people at the core of the "problem" as it were. The reason we use the term "semantic layer" in everything we're talking about here is because it is a means of communication, which is of course invented by people. (Or humanity if we want to go so far) How we conceptualise anything can be largely driven by the languages we speak, both linguistically and computationally. In every non-English language, there are things that cannot be translated in English, or explained as effectively in the source language. How we arrive at conclusions isn't universal, so I think the idea of a "Universal semantic layer" is incredibly difficult to achieve.
Ok. I think I see the issue. We are conflating OLAP's *calculation* capabilities with the 1990s implementation of materializing a physical cube (AtScale does NOT build cubes). What makes OLAP (and spreadsheets) so powerful is that they are *cell-based* calculation engines, not row-based engines like relational databases. That's why SQL can't be the only dialect a semantic layer supports. I argue that a multidimensional calculation engine is required to handle a wider variety of use cases and many of those require *cell-based* calculations. For example, functions like Previous, Next, Lead, Lag, Parallel Period, Ancestor, Parent, Descendants, Children, Siblings are all examples of cell-based calculations that may be required to express complex calculations in a semantic layer. I would love to show you how AtScale does this for a billion of rows of data.
At present (likely to change with the next substack I read) my data world is comprised of entities, the things in a particular domain that we care about, and events, the things that happen to those entities. These concepts have been pervasive over the years through various "flavor of the decade" data tools - facts, dimensions, business objects, models, stars, snowflakes, and even activity schemas. When we combine those in certain ways and apply very basic or very complex maths to them we yield measures (or metrics, or features). At best, our data models and semantic layers provide a map for navigating the relationships between our entities, events, and measures. And we often raise up certain combinations with special titles such that our circle of data colleagues (and their tools of choice) have a short-hand for referencing them (I'm talking about you "community-adjusted EBITDA before growth investments"). So perhaps our current data lingua-franca is to blame. Perhaps it lacks the expressiveness we require even if we appreciate its approachability. But what are we to do then if we yearn for something that is universal in both its analytic applicability and usability?
I don't know either Benn.
But if I were to try (better yet someone much more capable than me), I think I would start by building everything from entities, events, and measures. Everything else would simply be an intermediary, existing only for convenience of communication or speed of delivery.
I’m familiar with SQL but not dbtCore
So I’m confused abt how the former encode computation but not entity while the latter encode entity and not computation
Can anyone give me examples so I can grok better? I’m more of a web developer
I mean I can create tables that represent entities in a RDBMS how is that not encoding entities? I’m confused
I am a big fan of the idea to have a semantic layer that is flexible enough to support blended querying. It seems like some flavor of SQL + Jinja that makes it easy to both pull well-defined metrics and augment them using ad-hoc queries against certified source tables will go a long way. If I am describing something that already exists or on the roadmap today, please point me in that direction so I can give them all my pocket money.
On it. #UniversalSchema