36 Comments

S3 alone makes for a poor data warehouse because of boring issues like access control. It's far easier to maintain roles and privileges in Snowflake than in AWS IAM. Additionally, features like dynamic row-level and column-level data masking aren't really possible with S3 alone.

Transactional consistency is also a problem. Multi-table transactions aren't natively supported, which makes failure recovery in multi-step transformation jobs more complicated. Instead of simply rolling back a transaction, you're stuck manually cleaning out bad data.

Expand full comment
Mar 3, 2023Liked by Benn Stancil

Nothing you referenced is a data warehouse.

A DW is the end result of specific methods to collect and organize data for analytics.

A DW is to arbitrary data sitting in S3 as a church is to a pile of lumber.

Expand full comment
Mar 3, 2023Liked by Benn Stancil

Clearly the biggest omission from this article is a reference to the following Simpsons clip when you mention "our now-robust balloon popping defenses":

https://www.youtube.com/watch?v=4RV3RXMNGVs&t=80s

Expand full comment

Great piece, Benn! Especially loved the floppy disk analogy that instantly and succinctly conveys the past widespread significance as well as the upcoming fall of the Warehouse.

And of course, glad to see DataOS in the mix. Would appreciate your take on the Data Operating System standard that DataOS is based on: https://data-operating-system.com/

Expand full comment

Great article!

Isn't the "compute engine(s) querying, and recently modifying data in S3/object storage" approach what the likes of Dremio, Databricks and Starburst are espousing? Their view is that this lakehouse (dumb term imho) is the new DW.

https://www.dremio.com/data-lakehouse/

Expand full comment

I have constantly been telling the orgs that I have worked with that we don't have to get hung up on the data store - the place we store data, or the format of it for that matter. I always think of a DW as a virtual representation of an org's analytics data. And if you want to actualize it in storage for performance, so be it.

Expand full comment
Mar 4, 2023Liked by Benn Stancil

Heh. The reason the iPhone was called a phone was because that was the share of the wallet Apple was competing for. It has nothing to do with technology, and everything with who pays for it, and why.

Expand full comment
Mar 3, 2023Liked by Benn Stancil

To David Andersen's point, data warehousing is a use case, and we have decades of information now on the methods to collect and organize data to meet data warehouse use cases. Some databases and/or platforms are better at supporting this use case than others.

Having worked on a product for years that didn't have a well defined analyst quadrant, I think Snowflake did the right thing to define what it does as something different: The Data Cloud. It doesn't mean Snowflake doesn't still target Data Lake and Data Warehouse use cases--it does, aggressively. But having lived through the time of "Hadoop is going to kill the data warehouse", I think the semantics of what the use case is is certainly less important than the capability of the platform to deliver on the business problem the customer is trying to solve. And if you can deliver Data Lake, Data Warehouse, Streaming and OLTP workloads all in the same managed platform to solve those customer problems, so much the better.

Snowflake messaging appears to be catching on, as well. For example, Google now talks about a "Data Cloud".

[Disclaimer: I work for Snowflake, but do not speak for them.]

Expand full comment
Mar 3, 2023Liked by Benn Stancil

It can also go the other way around, Snowflake can talk to Google Sheets via external functions [1] or a Parquet file in your S3 bucket. The reason why nobody is doing that is because it's much slower than native tables so you usually "cache" the data in native tables via their `COPY` command or an ETL tool such as Fivetran. It's not even a product limitation, it's because moving the data is slower and even more expensive so often you end up caching the data in the data-warehouse.

My understanding is that the data-warehouse is where the teams access the company data. It's not necessarily "the single store of truth" but rather where you access "the truth".

[1]: https://docs.snowflake.com/en/sql-reference/external-functions-introduction

[2]: https://docs.snowflake.com/en/user-guide/querying-stage

Expand full comment

I don't think data warehouses are necessarily so ambiguous. They are defined by the functions they enable, in particular being a "single source of truth" and refining raw data into business-usable data assets. They consolidate various systems of records across the company and provide a centralized location where we can perform validation, semantics and transformation. I agree they shouldn't be defined by their infrastructural design, as this is constantly changing due to technology improvements (Postgres "data warehouse" vs. Excel vs. MapReduce vs. Redshift). Their function however - centralized OLAP - is largely stable.

Expand full comment
Mar 5, 2023·edited Mar 5, 2023

Snowflake can't compete as a database; its performance and price/performance are fundamentally hampered by its design. They need to stay as far away from this aspect of the market as possible, and "data cloud" is exactly the crackpipe that middle managers want to smoke.

Expand full comment