The ghosts in the data stack

Mar 25, 2022

An OLAP cube exorcism.

18 Comments

Mar 28, 2022

Great post and thank you for addressing an oft-overlooked topic. I define databases by what I think of as their fundamental modeling unit. RDBMSs model everything as a relation (or table). XML databases model everything as an XML document. Hierarchical databases as hierarchies. Network databases as a directed acyclic graph. Multi-dimensional DBMSs (MDDBMSs) aka OLAP cubes model everything as hypercubes. A hypercube has (hierarchical) dimensions as edges and store metrics in the intersections -- the classic example being product, geography, and time. As in, show me sales (the metric) of widgets (product) in NY (geo) in 3Q20 (time). Not sure if that helps much but it tries to give a conceptual definition of them. They were initially launched to be generally useful but found only one real use-case: financial planning and budgeting apps which do a slot of slicing, dicing, and drilling (and aggregation) across dimensions. EPM vendors today all deal with the basic nightmare that OLAP databases have died off and, to my knowledge, only Anaplan has built a real substitute. You are correct that they were fast, trading precomputation and storage for speed. You are also correct that they are not relevant in the database industry these days and that, e.g., Microsoft is basically phasing SSAS out leaving something in place that almost but not quite replaces it if you're building an EPM application. FWIW, I don't find the cube-viewer interfaces counter-intuitive; Cognos made a fortune selling PowerPlay for years which was a cube-viewer. At BusinessObjects, where I worked, we tended to build cubes on the fly (which size limited them) and presented a more dynamic reporting interface as opposed to a slicer/dicer interface which people, to your point, seemed to like better.

Expand full comment

Reply (1)

Benn Stancil

Mar 30, 2022

Thanks (and fan of the blog!). And as I was looking through old products for this post, I was struck by the Business Objects interface (like this one: https://i1.wp.com/bisystembuilders.com/wp-content/uploads/2011/03/Explorer.jpg), and how much it actually moved away from the OLAP slicing and dicing that so many other tools offer. I think there's a lot to be learned from that. New companies tend to look at older products and think they have to reinvent the way they did things, but my suspicion is it's exactly the opposite - we should recognize that those products were successful for a reason, and, while technologies may be different, people aren't.

Expand full comment

BenShen

Feb 28

https://open.substack.com/pub/benn/p/ghosts-in-the-data-stack?r=4fyhb4&selection=175e3ca1-134d-4027-99a9-d4cefd1f2236&utm_campaign=post-share-selection&utm_medium=web

Revisiting this paragraph in the age of LLM 2025 gives me chill

Expand full comment

Reply (1)

Benn Stancil

Feb 28

Ah, the Census website, enjoy it while we still can.

Expand full comment

Mike Mahoney

Mar 27, 2022

Good post! I have to disagree with the Census as the worst government data portal, though; my personal nominee is trying to get soil data from the USDA (though can think of plenty of other critical data sets/sources that are only a smidge better):

https://websoilsurvey.sc.egov.usda.gov/App/HomePage.htm

Expand full comment

Reply (1)

Benn Stancil

Mar 28, 2022

ok but that big green button is spectacular

Expand full comment

Cirrus Shakeri

Mar 26, 2022

One way to build more intuitive data tools is to go back to the first principles, in this case understanding why relational databases are called relational. A table is not necessarily describing a set of objects. Otherwise Codd would have called his invention object databases. A table in relational databases describes an n-ary relation between a set of attributes--not even a set of things but a set of attributes that might belong to different things. Sometimes the set of attributes happen to map well to distinct objects such as the purchases in Benn's example but other times it might just describe the relations between aggregated attributes that map better to the business questions. Relational tables are a genius way to model the real world where everything is somehow related to everything else. It's a way to draw some boundaries around blobs of things in the problem domain and store them in databases. So if I were to experiment with designing a more intuitive data tool for business users that didn't have to rely on cubes, I would start from 'relations'. A tool that would let business users describe how sales, states, months of the year and products are related. I have a hunch that it might lead to some interesting or even maybe more helpful new designs.

Expand full comment

Reply (1)

Benn Stancil

Mar 28, 2022

Yeah, and to be clear, I don't have an objection with relational models, or even with cubes as a piece of technology that resolves how we connect things together, make aggregations across those related ideas possible, and things like that. The question for me is how do you best expose those ideas to people who aren't as steeped in these concepts.

Expand full comment

Dave Mariani

Mar 25, 2022

I have to disagree with your point that OLAP is less intuitive for users. To me, OLAP has two main benefits: (1) it's fast and (2) it's a business-friendly semantic layer that allows "anything by anything" queries. It's the business-friendly semantic layer that makes OLAP intuitive for everyone, not just data engineers. The speed makes it even better because users can ask questions as fast as they can think of them. I would agree with you that OLAP as a technology, meaning pre-calculation of aggregates, is dead. By pre-calculating views of data you inherently limit which questions can be asked. The key is to keep the OLAP benefits of speed and semantics and ditch the technical implementation that stopped scaling in the 90s.

Expand full comment

Reply (1)

Benn Stancil

Mar 26, 2022

Even a lot of data people don’t understand cubes though. That’s not to say that they’re impossible to make sense of, or that some people don’t prefer that model. But it seems uncontroversial to me to say that they’re often challenging to understand.

Expand full comment

Reply (1)

Dave Mariani

Mar 28, 2022

Yes, I agree that thinking in multiple dimensions is a mindblower. My experience at Yahoo! though was a game changer for my thinking about dimensionality. I delivered a 24TB SSAS cube to our display advertising team that had hundreds of metrics and attributes. I was worried that my users would get lost but they used it to deliver $50m in extra value in the business annually and created the most amazing analytics that I never imagined was possible. That experience convinced me that a multidimensional interface coupled with subject matter experts and a good BI tool like Tableau (or Excel!) can deliver unbounded value.

Expand full comment

Reply (1)

Scott Edwards

Mar 31, 2022

Awesome example, Dave. I miss the SSAS days (and MDX!) - curious, what sort of insights and analytics did they product that amazed you so much?

Expand full comment

Reply (1)

Dave Mariani

Mar 31, 2022

There were a couple of categories where I was blown away. First, they developed spreadsheet models using the data that allowed them to change campaign targeting rules as a campaign was running - that's what delivered the $50m worth of lift in the advertising revenue. Second, it was the Tableau visualizations. They were works of art and not something that I could have ever imagined could be done.

Expand full comment

Gordon Wong

Mar 25, 2022

I both agree and disagree with your point. If we focus on olap cubes as a discrete structure in themselves, I think we miss the point. Instead, I would suggest looking at a cube as a model of dimensions and possible intersections. Each "cell" in the cube is a crossing of dimensions and descriptions of that. Cubes get super complicated because they represent so many possibilities.

Expand full comment

Reply (1)

Andrew Padilla

Mar 25, 2022

Gordon, doesn't your rationalization require implicit knowledge for a user to even understand? If we don't meet the user where they are, aren't we missing the point no matter how elegant (or true) our rationalizations about the data are?

Expand full comment

Reply (1)

Gordon Wong

Mar 25, 2022

I think I understand your point Andrew...maybe? Are you saying that if we go down the path of modeling the data, without due focus on the problem the users are trying to solve, then we're making it really hard for them to use our solutions. Doesn't matter how accurate the solution...we're setting up the users for a rough ride. If so...yes, I totally agree!

As solution builder, if you don't start with the why and the users' needs, you're making a big mistake ie big data, 2 year data warehouse builds, most-of-the-projects-in-the-1st-half-of-my-career.

Getting back to cubes, I think a better way of making my point is to encourage us to continue to think multi-dimensionally. Cubes are just a method for guiding that thinking and maybe build structures that support that thinking. Consider a simple Sales Cube. Could be revenue and cost by time, geography and product. Time, geography and product are the 3 dimensions of the cube and revenue and cost are the attributes at the intersection of them.

Expand full comment

Reply (1)

Benn Stancil

Mar 26, 2022

Ideally, sure, we should do more to help everyone to think about data as a cube, and all the benefits that come along with that. In practice though, I don’t think that’ll happen. We’ve had cubes for a long time, and people - including a lot of data people! - still don’t really get it. If the industry’s aim is to “make the world more data literate” and all that, I think we’ve got to do a lot more to meet people where they are, rather than trying to drag them towards a (inherently confusing) abstraction of data like a cube.

Expand full comment

Reply (1)

Gordon Wong

Mar 28, 2022

I'm probably guilty of over-simplifying here. To me, cubes and star schemas are basically the same concept. They're just ways to think about events and context. Neither is an end state...heck, it would be a mistake to even consider them data products.

I'm going to back your play here...let's focus on the last mile, maybe the last inch and work backwards from the business problem

Expand full comment

benn.substack

The ghosts in the data stack