How a data warehouse could become a data platform—and an organizational brain.
I truly embrace the idea of a Metrics Layer in the reference modern data architecture. Somehow it reminds me of the OLAP cubes I used to work on in the '90s when the classic BI architecture was simplified, but I know today it's a different "beast," and, indeed, is not a revolutionary concept in the data world.
I think standardization, adaptability, and portability of entities and metrics are critical factors in implementation. There are 8000+ SaaS apps (and still counting) in the Martech, each with its data model and API interfaces, and a standard company uses around 20+ of them daily. Additionally, we have internal apps that serve various consumption types.
First, in the Metrics Layer tool, we need to configure (or better, have pre-configured) standard entities and metrics that can easily be shared with these 20+ SaaS apps (via Reverse ETL tools or directly) and with the internal apps.
Second, the Metric Layer tool must keep up – as easy and transparent as possible - with the evolving data model of the external SaaS apps (which we cannot control) and with the internal apps (which we control and be the "leader" for this change).
Third, if we build the Metric Layer in tool A, how easy is it to port these in tool B? Once having a Metric Layer becomes "hot" in the market, many tools will pop up claiming to do this function. But if we need to refactor everything in Tool B because of Tool A's vendor and data lock, we may lose a lot of resources and business opportunities in doing so.
The DataOs is an exciting concept, but I'm advocating for Open DataOs; hence these three characteristics (and others) are necessary.
There are some open points I'm trying to clarify in my head, but I don't have yet a response
1. There is a thin line between Metrics Layer and Reverse ETL Layer, and it seems that these two will merge at some point, if not already. It looks like we need to reinvent the classical ESB (Enterprise Service Bus) with a more "data-awareness" flavor for me.
2. Will the DWH be "stripped" by metrics, as these will be created/moved in the Metrics Layer? In this case, the DWH will mainly act as an MDM (Master Data Management) system for harmonizing data coming from different data sources at the level of entities (e.g., Customer, Vendors, Products). Are these entities propagated to the Metric Layer (as core entities, as you mentioned), or are we building another level/wrapper around them in the Metric Layer? In this case, may the Metric Layer become more critical in time than the DWH itself?
3. Governance. Big subject. . Who will own the Metrics Layer: Business? Data? IT? Will be some shared responsibility, most probably.
A year later and Census has a full entities product - https://www.getcensus.com/blog/census-entities-make-your-most-important-data-available-for-everyone
I’m skeptical of a “SQL-like API”. How does “GET ENTITY customers FILTER plan = ‘free’” improve upon “SELECT * FROM customers WHERE plan = ‘free’”, assuming that `customers` is some sort of semantic abstraction? The world seems to just be uniting around SQL, for better or worse, and introducing another language on top of SQL feels like the wrong direction.
I’m on board with everything else in the blog. We used dbt to build a “semantic-free” (what you called “legless” in another post) BI experience for FlexIt Analytics:
dbt is not fully there on the entity semantic layer yet, but it feels primed to be. And clearly the next step is not just semantics, but give me the data too. Hopefully the coming metrics server focuses more broadly on “entity,” as you suggested.
Great post Benn. Big believer in the direction you're advocating here. We've recently launched a new take on our product (deliberately slimmed down to the single, simple one-to-many "newsletters" use case): https://www.getvero.com/newsletters/.
A very direct approach in line with the comment "Workday, Salesforce, Adobe—they’re going to be reimplemented as apps on top of the data layer." Assuming the entity layer is defined at the database layer we hope to help teams tap right into that and skip the reverse ETL step entirely. We'll see how this plays out in time but the response has been solid so far and there's definitely a lot of positive energy in this direction.
As you point out there are challenges with "building on top of the entity layer" directly, particularly as the use cases get more complex (e.g. real time messaging vs. batch sends) but we're thinking about these things and we think the sooner the better.
We use "dbt" internally to define our "entity layer" and internally I am loving the direct approach to using this layer. Looking forward to reading more of your thoughts and hopefully being part of the conversation at Vero as this movement gains traction.
This has really got my brain ticking! The warehouse + entity layer as the backend for SaaS tools is such a powerful idea, making it redundant to sync data to hundreds of apps and pay them all to literally store your data that might very well be inaccurate/outdated/inconsistent/unusable.
Also, the entity layer can also simplify attributing events to multiple entities, right? Example: User X performed event A under account Account 123 and event B under Account 789
First off, I love this idea. and now the inevitable but. In order for applications to be built on top of the entity layer, wouldn't there need to be a shared architecture for what those entities are? And what makes up their core structure?
For example, would we need Marketo and Salesforce to have the same architecture for a "Customer" entity for this to work?
What you're talking about is an *ontology* for ubiquitous data objects. There have been lots of attempts to create universal ontologies (often in OWL) or transitions between them (SWRL), and they can be decent at mapping new data onto a well-understood structure.
But they, too, are brittle when encountering the nuances and zaniness of data in the real world. OK, a customer has one canonical name. Except when they have subsidiaries, or changed it by marriage during their tenure. They have one balance that we're billing against. Except the one for accounting accrual purposes. And this other one we created to filter for certain exceptions. This external source trying to tell us that a customer has moved is correct, we should update our info! Wait, now there's 3, which disagree with each other. And ever on and on, translating theory into the land of operational reality.
IIRC Google has published some of their ontologies that drive the entity structure of the entities they surface in search results (and map some of those results onto in order to summarize). And there are attempts to standardize corporate entities (see Thomson-Reuters OpenID, or OpenCorporates), but of course the info available about companies is as varied as the companies themselves - at least humans are limited to one body and one life. Mostly.
Finally, entities eventually become writable. The problem here is that everything we've built so far in this space was never designed for writing or editing, and all the challenges that come with it.
Mutating data in warehouses is a nightmare, the innovations that make Snowflake, BQ fast and scalable are the same ones that make them bad at in-place updates, or any indexed queries - both from a performance perspective but also as an API (mutations in SQL are not fun). If we want true, cross platform data integration or a data-layer that apps build on top of (which aren't just read-only) then from a technical standpoint I would expect it to not look anything like the current modern data stack today.
Not to say that there isn't an interesting and broad scope for building mostly read-only data apps on top of our current stack, but addressing mutability makes this really exciting, and I don't think we can easily slap it on at the end.
I feel like this is describing Materialize as the entity layer what do you think?