Data's trillion dollar question mark

Jan 7, 2022

How a data warehouse could become a data platform—and an organizational brain.

26 Comments

Jan 9, 2022Edited

I truly embrace the idea of a Metrics Layer in the reference modern data architecture. Somehow it reminds me of the OLAP cubes I used to work on in the '90s when the classic BI architecture was simplified, but I know today it's a different "beast," and, indeed, is not a revolutionary concept in the data world.

I think standardization, adaptability, and portability of entities and metrics are critical factors in implementation. There are 8000+ SaaS apps (and still counting) in the Martech, each with its data model and API interfaces, and a standard company uses around 20+ of them daily. Additionally, we have internal apps that serve various consumption types.

For example:

First, in the Metrics Layer tool, we need to configure (or better, have pre-configured) standard entities and metrics that can easily be shared with these 20+ SaaS apps (via Reverse ETL tools or directly) and with the internal apps.

Second, the Metric Layer tool must keep up – as easy and transparent as possible - with the evolving data model of the external SaaS apps (which we cannot control) and with the internal apps (which we control and be the "leader" for this change).

Third, if we build the Metric Layer in tool A, how easy is it to port these in tool B? Once having a Metric Layer becomes "hot" in the market, many tools will pop up claiming to do this function. But if we need to refactor everything in Tool B because of Tool A's vendor and data lock, we may lose a lot of resources and business opportunities in doing so.

The DataOs is an exciting concept, but I'm advocating for Open DataOs; hence these three characteristics (and others) are necessary.

There are some open points I'm trying to clarify in my head, but I don't have yet a response

1. There is a thin line between Metrics Layer and Reverse ETL Layer, and it seems that these two will merge at some point, if not already. It looks like we need to reinvent the classical ESB (Enterprise Service Bus) with a more "data-awareness" flavor for me.

2. Will the DWH be "stripped" by metrics, as these will be created/moved in the Metrics Layer? In this case, the DWH will mainly act as an MDM (Master Data Management) system for harmonizing data coming from different data sources at the level of entities (e.g., Customer, Vendors, Products). Are these entities propagated to the Metric Layer (as core entities, as you mentioned), or are we building another level/wrapper around them in the Metric Layer? In this case, may the Metric Layer become more critical in time than the DWH itself?

3. Governance. Big subject. . Who will own the Metrics Layer: Business? Data? IT? Will be some shared responsibility, most probably.

Expand full comment

On questions 1 and 2, I think there can (and will) be fairly clear lines between metrics, the DWH, and reverse ETL. As I see it, the DWH is two things: storage, and compute. You need both, and it doesn't make sense to pull either out of the warehouse (probably). A metrics layer is configuration - it's a recipe for compute, but not the processing itself. The layer itself is just a translation, from one type of request into a command that can be understood by the compute layer.

Reverse ETL interacts with the metrics layer in that it's issuing a command to be interpreted, but most of its work is in the pipelines into 3rd party apps. That's a very different problem than configuring metrics or entities to me.

Expand full comment

Jan 12, 2022Edited

Thanks for the clarifications. The logical (maybe also technical) separation between the Data warehouse, Entity layer, and Reverse ETL should be evident (hence my 3rd question on data governance). But when it comes to the physical implementation, the human element is a critical factor for not “mess things up.” Having a universal semantic layer that can speak the business language is worth trying and pursuing.

Expand full comment

Yeah, the human side is always the hard side. And in a lot of ways, I think that's the root of the problem. It's not just humans, but humans who are all speaking (subtly) different languages about data.

Expand full comment

Osman Sheikhnureldin

A year later and Census has a full entities product - https://www.getcensus.com/blog/census-entities-make-your-most-important-data-available-for-everyone

Expand full comment

gonna send them a cease and desist unless they pay me my royalties

Expand full comment

Great read.

I’m skeptical of a “SQL-like API”. How does “GET ENTITY customers FILTER plan = ‘free’” improve upon “SELECT * FROM customers WHERE plan = ‘free’”, assuming that `customers` is some sort of semantic abstraction? The world seems to just be uniting around SQL, for better or worse, and introducing another language on top of SQL feels like the wrong direction.

I’m on board with everything else in the blog. We used dbt to build a “semantic-free” (what you called “legless” in another post) BI experience for FlexIt Analytics:

https://learn.flexitanalytics.com/docs/dbt/

dbt is not fully there on the entity semantic layer yet, but it feels primed to be. And clearly the next step is not just semantics, but give me the data too. Hopefully the coming metrics server focuses more broadly on “entity,” as you suggested.

Expand full comment

The main difference to me is the complexity of the customer object. In the pure SQL version, it's just a table, which means it has to be a flat relationship (one row per customer, where you can only filter on columns in that table, etc).

The abstraction above that would let the object be more complex than just a table, so that you could do something like SELECT * FROM customers WHERE purchases > 5, without having to pre-derive a purchases column in the customers table. If there was a semantic understanding of how to go from dim_customers to dim_purchases, the entity layer could do that on its own.

That's not novel, in that semantic layers like this have been around forever. The difference, I think, is the presentation of that layer. Typically, they expose themselves to users as either a set of relationships, or as an OLAP cube, both of which I think are difficult to understand. In this view, it exposes itself as entities, which is an easier concept for people to intuitively understand.

Expand full comment

Great post Benn. Big believer in the direction you're advocating here. We've recently launched a new take on our product (deliberately slimmed down to the single, simple one-to-many "newsletters" use case): https://www.getvero.com/newsletters/.

A very direct approach in line with the comment "Workday, Salesforce, Adobe—they’re going to be reimplemented as apps on top of the data layer." Assuming the entity layer is defined at the database layer we hope to help teams tap right into that and skip the reverse ETL step entirely. We'll see how this plays out in time but the response has been solid so far and there's definitely a lot of positive energy in this direction.

As you point out there are challenges with "building on top of the entity layer" directly, particularly as the use cases get more complex (e.g. real time messaging vs. batch sends) but we're thinking about these things and we think the sooner the better.

We use "dbt" internally to define our "entity layer" and internally I am loving the direct approach to using this layer. Looking forward to reading more of your thoughts and hopefully being part of the conversation at Vero as this movement gains traction.

Expand full comment

Hey Chris, thanks, and thanks for sharing. I remember wanting something like Vero years ago, when we were manually sending a bunch of CSVs to the marketing team for every email campaign they wanted to run. Reverse ETL tools definitely make that process easier, but as I said in the post, that eventually starts to feel like a kind of awkward solution. So very cool to see something like Vero that's not only solving that problem, but can probably rethink how we send emails because of its close connection to the db.

Expand full comment

Thanks Ben. Looking forward to seeing where the road takes us. We'll be sharing more about what we learn!

Expand full comment

Arpit Choudhury

Jan 11, 2022Edited

This has really got my brain ticking! The warehouse + entity layer as the backend for SaaS tools is such a powerful idea, making it redundant to sync data to hundreds of apps and pay them all to literally store your data that might very well be inaccurate/outdated/inconsistent/unusable.

Also, the entity layer can simplify attributing events to multiple entities, right? Example: User X performed event A under account Account 123 and event B under Account 789

Expand full comment

In theory, I'd think so. That seems to be what Segment is trying to do with Personas, and I think that'd work out here.

In practice, there would be a lot of complexities to work through, as Anna talked about here: (https://roundup.getdbt.com/p/rip-data-engineering). But at it's core, that single event stream is the idea: Core datasets, that can be the backbone of other apps.

Expand full comment

Arpit Choudhury

Yeah I read Anna's post which obviously resonated a lot -- I've spent a ridiculous amount of time figuring out how to make sense of product hierarchy in external tools. Personas, AFAIK, doesn't solve this elegantly -- it's still very user focused in how it resolves identities and builds personas.

Excited to see if and how the metrics layer helps solve this. Thanks again for sharing your thoughts!

Expand full comment

First off, I love this idea. and now the inevitable but. In order for applications to be built on top of the entity layer, wouldn't there need to be a shared architecture for what those entities are? And what makes up their core structure?

For example, would we need Marketo and Salesforce to have the same architecture for a "Customer" entity for this to work?

Expand full comment

Partially, but I think there's a solution here actually. Key entities, like customers, could be drawn from the warehouse via the entity layer. But that doesn't preclude individual apps from having specific ways to enrich them. The point is to share a core, not to share everything.

Expand full comment

What you're talking about is an *ontology* for ubiquitous data objects. There have been lots of attempts to create universal ontologies (often in OWL) or transitions between them (SWRL), and they can be decent at mapping new data onto a well-understood structure.

But they, too, are brittle when encountering the nuances and zaniness of data in the real world. OK, a customer has one canonical name. Except when they have subsidiaries, or changed it by marriage during their tenure. They have one balance that we're billing against. Except the one for accounting accrual purposes. And this other one we created to filter for certain exceptions. This external source trying to tell us that a customer has moved is correct, we should update our info! Wait, now there's 3, which disagree with each other. And ever on and on, translating theory into the land of operational reality.

IIRC Google has published some of their ontologies that drive the entity structure of the entities they surface in search results (and map some of those results onto in order to summarize). And there are attempts to standardize corporate entities (see Thomson-Reuters OpenID, or OpenCorporates), but of course the info available about companies is as varied as the companies themselves - at least humans are limited to one body and one life. Mostly.

Expand full comment

Fair enough, but, we still have to deal with that messiness. If a CEO asks me, "how many customers do we have?," they want and need a number, not a long explanation as to why that question is actually quite complicated. And if we can give that answer - if we can count every customer, somehow - we should be able to create a table that has one row for each one.

Expand full comment

ok, but you're not proposing "a table" above. Certainly nobody who's ever worked with customer data would - because from whatever core details we might define, off spins a universe of one-to-many relationships to attribute tables. And the ontology of those has never been anything close to standardized, and it surely varies from business to business even within an industry. If you're an analyst and someone asks you that, you've got a pre-built, but inflexible schema, which is (more to the point of your post) not portable from company to company, such that it could be productized.

So in order to build what you're describing, an architect would still have to design the particular schema (i.e., entity structure) that represents customers, customer attributes, and interactions / transactions, within the world of the company - and then a team would have to implement it. If they're an SAP user, they've got a million customizations to the SAP schema and processes. Likewise Oracle. Likewise a million other database-driven apps from CRM to HR to supply chain. Each company's entity structure remains a special snowflake, no two alike.

I may be missing your point, but I'm not seeing how a tech product emerges out of that scenario.

Expand full comment

I think all of that is fine? My point isn't that we can (or should) have unified ways of modeling businesses, or the relationships between all of these entities. Nor do I think that every company or app would want to represent them in exactly the same way. But, there are some core concepts that I'd want to apps to share. I want my CRM and my support software to be based on the same list of users; I'd want my product analytics tool and my A/B testing tool to use the same list of user actions. My argument isn't that we should replace the backend of these apps with one giant unified schema and model, but that we should try to keep these base entities consistent.

Expand full comment

OK. Sounds to me like you've just reinvented master data management, though. Would be a worthy post by you explaining why it never really caught on, other than at big manufacturers.

Expand full comment

Fair, and as I said at the end, no idea here is actually new; we're just digging up old stuff and seeing if it's time has come.

As for why it didn't catch on before, I don't know. It's something I'd have to dig a lot more into.

Expand full comment

Finally, entities eventually become writable. The problem here is that everything we've built so far in this space was never designed for writing or editing, and all the challenges that come with it.

Mutating data in warehouses is a nightmare, the innovations that make Snowflake, BQ fast and scalable are the same ones that make them bad at in-place updates, or any indexed queries - both from a performance perspective but also as an API (mutations in SQL are not fun). If we want true, cross platform data integration or a data-layer that apps build on top of (which aren't just read-only) then from a technical standpoint I would expect it to not look anything like the current modern data stack today.

Not to say that there isn't an interesting and broad scope for building mostly read-only data apps on top of our current stack, but addressing mutability makes this really exciting, and I don't think we can easily slap it on at the end.

Expand full comment

For sure - some other folks brought up that issue as well: https://twitter.com/sarahcat21/status/1486723639365947406

My view there is that we're a long way from warehouses being the literal backend to every app. But I think we're a lot closer to warehouses being a core entity backend, where the list of customers comes from the warehouse but other concepts are still part of the app. We're already moving in this direction with data getting pushed from into apps via reverse ETL, it's just kinda clumsy.

If that's how it works, writing back to the warehouse is still important in that case, but at a latency we'd expect from transactional apps. It's more like pushing changes to git.

Expand full comment

I feel like this is describing Materialize as the entity layer what do you think?

Expand full comment

Eh, I think that could be kinda true if it's just about making a warehouse more like a transactional db (which I don't actually think should be the goal). The bigger problem to me isn't latency, but semantics. If you really extend the idea of building apps on top of the warehouse forward, eventually latency becomes an issue. But we've got to figure out how to make data in a warehouse usable for apps before we make it up-to-date for apps.

Expand full comment

#nojs-banner { position: fixed; bottom: 0; left: 0; padding: 16px 16px 16px 32px; width: 100%; box-sizing: border-box; background: red; color: white; font-family: -apple-system, "Segoe UI", Roboto, Helvetica, Arial, sans-serif, "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Symbol"; font-size: 13px; line-height: 13px; } #nojs-banner a { color: inherit; text-decoration: underline; } This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts