S3 alone makes for a poor data warehouse because of boring issues like access control. It's far easier to maintain roles and privileges in Snowflake than in AWS IAM. Additionally, features like dynamic row-level and column-level data masking aren't really possible with S3 alone.
Transactional consistency is also a problem. Multi-table transactions aren't natively supported, which makes failure recovery in multi-step transformation jobs more complicated. Instead of simply rolling back a transaction, you're stuck manually cleaning out bad data.
Yeah, this is one of the things that feels like the s3 + dbt thing would really struggle to deal with. But, I could see something like iceberg developing standards for persisting these kinds of concepts alongside the data itself (if it doesn't already), which could make it manageable.
So I think that mostly makes sense, but there's something that makes me refuse to call a church-shaped building made out of board of Excel a church. I'd say that it's "like a church," but I"m not sure I could ever really say, yes, that't definitely a church. (And I said in another comment, I'd also probably call something that wasn't really shaped that much like a church but was made from Redshift boards a church.)
Which may well be wrong, but I don't it's that uncommon.
Clearly the biggest omission from this article is a reference to the following Simpsons clip when you mention "our now-robust balloon popping defenses":
(Also, I have to admit that I know nothing about the Simpsons. I'm not sure I've seen a full episode all the way through. I'm a let down to my generation.)
Great piece, Benn! Especially loved the floppy disk analogy that instantly and succinctly conveys the past widespread significance as well as the upcoming fall of the Warehouse.
And of course, glad to see DataOS in the mix. Would appreciate your take on the Data Operating System standard that DataOS is based on: https://data-operating-system.com/
So my somewhat crude take is the details don't really matter. The success or failure of these kinds of systems and layers will depend almost entirely on adoption - if everyone uses it, so will I, regardless of how well or poorly it's designed. And while that design can help with getting those first folks to use it, what matters a lot more is distribution and the mechanic that gets the thing to a critical mass. An incredible OS that's only useful when the ecosystem uses it won't go anywhere; a lousy OS that figures out a way to kickstart the cold start problem has real potential.
Absolutely! I like to think of it like an inertia problem. Often tech that has overcome the critical threshold of mass inertia and mass adoption are the ones that end up becoming milestones in the evolutionary landscape.
How quickly can one get started with it >>> How many amazing feats it can achieve
High resistance to the first step automatically cuts down the tech's future potential due to the intrinsic drop off patterns of adopters.
Only sort of related, but that makes me wonder what the world would look like if we were a lot better at chosing the "best" thing (eg, if instead of Substack being the blogging platform de jour because of various lucky reasons, the actual best one (whatever that is and means) was what we chose. I wonder how different the world would be.
Love this. Conclusion is open to interpretation, but this reminds me of the double-split experiment that changed the face of modern physics posing the observer (analogy for the choosers) as the one influencing the outcome of the experiment (read Universe). Perhaps the key is to garner enough "observers" to focus their choices on the "best" spectrum (whatever that is and means).
Isn't the "compute engine(s) querying, and recently modifying data in S3/object storage" approach what the likes of Dremio, Databricks and Starburst are espousing? Their view is that this lakehouse (dumb term imho) is the new DW.
And yeah, it's similar, though they're very enterprise-y products that seem to sell themselves in the way I mentioned in that footnote, where it's pretty hard to pin down exactly what to does. Which, valuable as that is, I think undercuts them as potential architectural revolutions, because that architecture is buried behind white papers about new IT data solutions.
I have constantly been telling the orgs that I have worked with that we don't have to get hung up on the data store - the place we store data, or the format of it for that matter. I always think of a DW as a virtual representation of an org's analytics data. And if you want to actualize it in storage for performance, so be it.
I think this makes sense, but that's also the freeing thought. If a warehouse isn't the object but what it does, maybe the warehouse could look very different than we picture them today.
Heh. The reason the iPhone was called a phone was because that was the share of the wallet Apple was competing for. It has nothing to do with technology, and everything with who pays for it, and why.
To David Andersen's point, data warehousing is a use case, and we have decades of information now on the methods to collect and organize data to meet data warehouse use cases. Some databases and/or platforms are better at supporting this use case than others.
Having worked on a product for years that didn't have a well defined analyst quadrant, I think Snowflake did the right thing to define what it does as something different: The Data Cloud. It doesn't mean Snowflake doesn't still target Data Lake and Data Warehouse use cases--it does, aggressively. But having lived through the time of "Hadoop is going to kill the data warehouse", I think the semantics of what the use case is is certainly less important than the capability of the platform to deliver on the business problem the customer is trying to solve. And if you can deliver Data Lake, Data Warehouse, Streaming and OLTP workloads all in the same managed platform to solve those customer problems, so much the better.
Snowflake messaging appears to be catching on, as well. For example, Google now talks about a "Data Cloud".
[Disclaimer: I work for Snowflake, but do not speak for them.]
Yeah, I'm sure they (y'all) can make it work, and as I said, you've got plenty of evidence that you're doing a lot very right. But, to your Hadoop example, that's sorta my point. Hadoop tried to kill the warehouse, and a lot of people were like, eh, we like our warehouse and will just keep it, tyvm. But what if Hadoop said, no, we're just a warehouse too. Just a better one. I don't think the tech would've delivered on that, but the promise might've made the transition easier. And for snowflake, it does deliver on that promise.
Mar 6, 2023·edited Mar 6, 2023Liked by Benn Stancil
I think in practice it is just very difficult to do. Startups have the advantage of no technical debt, but the downside is closing the gaps on perceived required features, some of which were built over decades. Certainly there were a ton of us in the data warehousing space at the time that were saying that it made no sense to use Hadoop for that purpose.
Companies in the Hadoop space were putting out that they could take on data warehouse workloads and fueling that press, but as Omar said, you come at the king, you best not miss. One of the first quotes I heard from Snowflake co-founder Benoit Dageville long before I ever joined was his brilliant reaction to "relational is dead". The reality was Snowflake was able to build an entire platform from scratch that showed why Benoit was correct that no amount of Apache Impala and other band aids on top of Hadoop were ever going to match. Interesting contrast to this approach was Teradata. One of their reactions was to buy Aster Data and try to have a hybrid solution.
Yeah, it seems like an entire market of products got built in the wake of that mistaken belief. If you were out betting against Hadoop and on relational tools, SQL, and so on in 2015, you did pretty well.
It can also go the other way around, Snowflake can talk to Google Sheets via external functions [1] or a Parquet file in your S3 bucket. The reason why nobody is doing that is because it's much slower than native tables so you usually "cache" the data in native tables via their `COPY` command or an ETL tool such as Fivetran. It's not even a product limitation, it's because moving the data is slower and even more expensive so often you end up caching the data in the data-warehouse.
My understanding is that the data-warehouse is where the teams access the company data. It's not necessarily "the single store of truth" but rather where you access "the truth".
I don't think data warehouses are necessarily so ambiguous. They are defined by the functions they enable, in particular being a "single source of truth" and refining raw data into business-usable data assets. They consolidate various systems of records across the company and provide a centralized location where we can perform validation, semantics and transformation. I agree they shouldn't be defined by their infrastructural design, as this is constantly changing due to technology improvements (Postgres "data warehouse" vs. Excel vs. MapReduce vs. Redshift). Their function however - centralized OLAP - is largely stable.
I can see this, though something about it feels...unsatisfying. I think it's like the phone analogy. You could make the case that a phone is portable device that enables certain functions, but unless you talk about the physical product itself, it's hard to explain why an iPad isn't a phone.
This feels similar to me. For example, if someone was using Excel for all of these data warehousing functions, and they asked me if Excel was their warehouse, I'd probably no, but you're using it like it is. Whereas if they were using Redshift to do only a couple of these things and asked me the same question, I'd probably say, it is the warehouse, but they're not using it right.
But I'm not sure why I'd say that. True, I may well be wrong, but something makes me want to give that answer.
Snowflake can't compete as a database; its performance and price/performance are fundamentally hampered by its design. They need to stay as far away from this aspect of the market as possible, and "data cloud" is exactly the crackpipe that middle managers want to smoke.
The DW (and database) market has some aspects of a commodity market, where basic stats such as price/performance and cost per query are measured and compared, and some innovative aspects where new features come out and people pay a premium.
Column stores are an example of innovation that significantly changed the commodity metrics -- they're several times faster on most DW workloads, and that feature alone sold billions of dollars worth of software.
Snowflake came along without a number of the features which make column stores fast. It lags in most benchmarks, including internal ones I've done myself. It doesn't even delete data well.
But it is "the data cloud". I to be honest have no idea what that means. I run a large ad sales application that pulls data from an S3 data lake, ingests and exports data to a number of cloud destinations, external customers, and ML models, and delivers queries to a web application in 480 ms median, for about $2 per.
None of this uses Snowflake, but if you whisper the words "data cloud" into our VPs' ears, they will rapturously recite how it has transformed our organization.
1) Costs don't exist in a vacuum. Imagine a world where you have 20 full time DBAs to stand up and tune an enterprise MPP platform and after you migrate to Snowflake you use a single FTE to manage it. Imagine not having backups. Imagine not vacuuming tables, etc. Imagine resizing a cluster in under a second up or down and having them shut down automatically when not in use so you don't have to pay for them. Lots of vendors in this space conveniently leave out the maintenance tasks from their cost calculators. And some vendors hide a lot of their cost in the customer's cloud provider bill separate from the price they charge the customer directly.
2) Many benchmarks aren't representative of a customer's real workload nor of how customers use Snowflake in practice.
3) Traditionally Snowflake has targeted analytic use cases and hasn't targeted operational use cases. If you have a 480ms SLA query, that is probably operational in nature. Last year Snowflake announced Unistore / hybrid tables to target exactly this type of use case, and that feature is in Private Preview now.
4) Think of the Data Cloud as both the platform as well as data in that platform potentially available via live sharing. Take Salesforce.com. Very popular system, 150k customers. Today, if you want to get data out quickly, you probably have to call an API, and tons of ETL vendors exist today that will connect to that API, download the data, and then load it into your own database. If there are 70 tables you want to pull down, maybe you have 70 jobs. You may also be subject to Salesforce API limits. Where does the Data Cloud come in? Last September Salesforce announced Genie which has an option for a live Snowflake share which is currently in Private Preview. Something happens in Salesforce and it will be available via query in Snowflake. Without running any jobs. Immediately. This data doesn't get downloaded from your Salesforce into your Snowflake account, yet you can query it like it is.
Now imagine that it isn't just Salesforce. It is every partner, vendor, and customer you ever have to exchange data with. Live sharing in a secure, governed way also opens up use cases like clean rooms.
All that makes sense, and I get why it's all potentially valuable, though I don't necessarily see why that means the "data cloud" branding is necessary. Again, it clearly works, so who am I question it. But to the iPhone point, you could certainly list a bunch of things that make an iPhone way more than phone. Yet, we still talk about them as phones, not as some new thing. You could potentially even go further than that, and say the Apple _went back_ to the phone branding by not launching the iPhone as a PDA or something like that.
(Of course, this is all mostly academic and doesn't really matter, but that could also be the title of this blog - MostlyAcademicAndDoesntReallyMatter dot substack dot com)
Gotcha. Yeah, that seems like the argument for both sides of this in a nutshell, actually.
For data folks, it seems like Snowflake would've been better off saying this is the new definition of a warehouse, and try to make the features they have (and not the features they don't) what's expected of any warehouse you buy. But, the data cloud bit seems to work with VPs, so, I guess they know what they're doing.
S3 alone makes for a poor data warehouse because of boring issues like access control. It's far easier to maintain roles and privileges in Snowflake than in AWS IAM. Additionally, features like dynamic row-level and column-level data masking aren't really possible with S3 alone.
Transactional consistency is also a problem. Multi-table transactions aren't natively supported, which makes failure recovery in multi-step transformation jobs more complicated. Instead of simply rolling back a transaction, you're stuck manually cleaning out bad data.
Yeah, this is one of the things that feels like the s3 + dbt thing would really struggle to deal with. But, I could see something like iceberg developing standards for persisting these kinds of concepts alongside the data itself (if it doesn't already), which could make it manageable.
Nothing you referenced is a data warehouse.
A DW is the end result of specific methods to collect and organize data for analytics.
A DW is to arbitrary data sitting in S3 as a church is to a pile of lumber.
I concede that I have not ruled out the possibility that a DW is also a sandwich. I will eat a few more and report back.
So I think that mostly makes sense, but there's something that makes me refuse to call a church-shaped building made out of board of Excel a church. I'd say that it's "like a church," but I"m not sure I could ever really say, yes, that't definitely a church. (And I said in another comment, I'd also probably call something that wasn't really shaped that much like a church but was made from Redshift boards a church.)
Which may well be wrong, but I don't it's that uncommon.
Got a little lost in the mixed analogies. :)
If someone can deliver all the requirements of a DW in Excel (or PowerPoint, or WordPerfect, or Access, or a turkey sandwich) - it's a DW.
Redshift, by itself, is not. Nor is Snowflake. Nor is the Oracle RDBMS with star joins enabled. Etc.
I'll concede that that's a consistent definition, but I won't concede that I like it.
Clearly the biggest omission from this article is a reference to the following Simpsons clip when you mention "our now-robust balloon popping defenses":
https://www.youtube.com/watch?v=4RV3RXMNGVs&t=80s
Ok this is good.
(Also, I have to admit that I know nothing about the Simpsons. I'm not sure I've seen a full episode all the way through. I'm a let down to my generation.)
What were you watching back then?
Seinfeld is the only comedy I really watched from the 90s. Plus some all that and kenan and kel.
Great piece, Benn! Especially loved the floppy disk analogy that instantly and succinctly conveys the past widespread significance as well as the upcoming fall of the Warehouse.
And of course, glad to see DataOS in the mix. Would appreciate your take on the Data Operating System standard that DataOS is based on: https://data-operating-system.com/
So my somewhat crude take is the details don't really matter. The success or failure of these kinds of systems and layers will depend almost entirely on adoption - if everyone uses it, so will I, regardless of how well or poorly it's designed. And while that design can help with getting those first folks to use it, what matters a lot more is distribution and the mechanic that gets the thing to a critical mass. An incredible OS that's only useful when the ecosystem uses it won't go anywhere; a lousy OS that figures out a way to kickstart the cold start problem has real potential.
Absolutely! I like to think of it like an inertia problem. Often tech that has overcome the critical threshold of mass inertia and mass adoption are the ones that end up becoming milestones in the evolutionary landscape.
How quickly can one get started with it >>> How many amazing feats it can achieve
High resistance to the first step automatically cuts down the tech's future potential due to the intrinsic drop off patterns of adopters.
Only sort of related, but that makes me wonder what the world would look like if we were a lot better at chosing the "best" thing (eg, if instead of Substack being the blogging platform de jour because of various lucky reasons, the actual best one (whatever that is and means) was what we chose. I wonder how different the world would be.
Love this. Conclusion is open to interpretation, but this reminds me of the double-split experiment that changed the face of modern physics posing the observer (analogy for the choosers) as the one influencing the outcome of the experiment (read Universe). Perhaps the key is to garner enough "observers" to focus their choices on the "best" spectrum (whatever that is and means).
Great article!
Isn't the "compute engine(s) querying, and recently modifying data in S3/object storage" approach what the likes of Dremio, Databricks and Starburst are espousing? Their view is that this lakehouse (dumb term imho) is the new DW.
https://www.dremio.com/data-lakehouse/
Thanks!
And yeah, it's similar, though they're very enterprise-y products that seem to sell themselves in the way I mentioned in that footnote, where it's pretty hard to pin down exactly what to does. Which, valuable as that is, I think undercuts them as potential architectural revolutions, because that architecture is buried behind white papers about new IT data solutions.
I have constantly been telling the orgs that I have worked with that we don't have to get hung up on the data store - the place we store data, or the format of it for that matter. I always think of a DW as a virtual representation of an org's analytics data. And if you want to actualize it in storage for performance, so be it.
I think this makes sense, but that's also the freeing thought. If a warehouse isn't the object but what it does, maybe the warehouse could look very different than we picture them today.
Heh. The reason the iPhone was called a phone was because that was the share of the wallet Apple was competing for. It has nothing to do with technology, and everything with who pays for it, and why.
But that's the point, no? Money today is spent on databases and warehouses, not distributed data lakes and data clouds.
To David Andersen's point, data warehousing is a use case, and we have decades of information now on the methods to collect and organize data to meet data warehouse use cases. Some databases and/or platforms are better at supporting this use case than others.
Having worked on a product for years that didn't have a well defined analyst quadrant, I think Snowflake did the right thing to define what it does as something different: The Data Cloud. It doesn't mean Snowflake doesn't still target Data Lake and Data Warehouse use cases--it does, aggressively. But having lived through the time of "Hadoop is going to kill the data warehouse", I think the semantics of what the use case is is certainly less important than the capability of the platform to deliver on the business problem the customer is trying to solve. And if you can deliver Data Lake, Data Warehouse, Streaming and OLTP workloads all in the same managed platform to solve those customer problems, so much the better.
Snowflake messaging appears to be catching on, as well. For example, Google now talks about a "Data Cloud".
[Disclaimer: I work for Snowflake, but do not speak for them.]
Yeah, I'm sure they (y'all) can make it work, and as I said, you've got plenty of evidence that you're doing a lot very right. But, to your Hadoop example, that's sorta my point. Hadoop tried to kill the warehouse, and a lot of people were like, eh, we like our warehouse and will just keep it, tyvm. But what if Hadoop said, no, we're just a warehouse too. Just a better one. I don't think the tech would've delivered on that, but the promise might've made the transition easier. And for snowflake, it does deliver on that promise.
I think in practice it is just very difficult to do. Startups have the advantage of no technical debt, but the downside is closing the gaps on perceived required features, some of which were built over decades. Certainly there were a ton of us in the data warehousing space at the time that were saying that it made no sense to use Hadoop for that purpose.
Companies in the Hadoop space were putting out that they could take on data warehouse workloads and fueling that press, but as Omar said, you come at the king, you best not miss. One of the first quotes I heard from Snowflake co-founder Benoit Dageville long before I ever joined was his brilliant reaction to "relational is dead". The reality was Snowflake was able to build an entire platform from scratch that showed why Benoit was correct that no amount of Apache Impala and other band aids on top of Hadoop were ever going to match. Interesting contrast to this approach was Teradata. One of their reactions was to buy Aster Data and try to have a hybrid solution.
Yeah, it seems like an entire market of products got built in the wake of that mistaken belief. If you were out betting against Hadoop and on relational tools, SQL, and so on in 2015, you did pretty well.
It can also go the other way around, Snowflake can talk to Google Sheets via external functions [1] or a Parquet file in your S3 bucket. The reason why nobody is doing that is because it's much slower than native tables so you usually "cache" the data in native tables via their `COPY` command or an ETL tool such as Fivetran. It's not even a product limitation, it's because moving the data is slower and even more expensive so often you end up caching the data in the data-warehouse.
My understanding is that the data-warehouse is where the teams access the company data. It's not necessarily "the single store of truth" but rather where you access "the truth".
[1]: https://docs.snowflake.com/en/sql-reference/external-functions-introduction
[2]: https://docs.snowflake.com/en/user-guide/querying-stage
On Snowflake, it makes sense why you wouldn't use it that way, though you *could.* But fair that it'd be pretty silly to actually do it.
I like the point about access. It's sort of the looking glass through which you see the truth. I can get down with that.
I don't think data warehouses are necessarily so ambiguous. They are defined by the functions they enable, in particular being a "single source of truth" and refining raw data into business-usable data assets. They consolidate various systems of records across the company and provide a centralized location where we can perform validation, semantics and transformation. I agree they shouldn't be defined by their infrastructural design, as this is constantly changing due to technology improvements (Postgres "data warehouse" vs. Excel vs. MapReduce vs. Redshift). Their function however - centralized OLAP - is largely stable.
I can see this, though something about it feels...unsatisfying. I think it's like the phone analogy. You could make the case that a phone is portable device that enables certain functions, but unless you talk about the physical product itself, it's hard to explain why an iPad isn't a phone.
This feels similar to me. For example, if someone was using Excel for all of these data warehousing functions, and they asked me if Excel was their warehouse, I'd probably no, but you're using it like it is. Whereas if they were using Redshift to do only a couple of these things and asked me the same question, I'd probably say, it is the warehouse, but they're not using it right.
But I'm not sure why I'd say that. True, I may well be wrong, but something makes me want to give that answer.
Snowflake can't compete as a database; its performance and price/performance are fundamentally hampered by its design. They need to stay as far away from this aspect of the market as possible, and "data cloud" is exactly the crackpipe that middle managers want to smoke.
When you say '"this aspect of the market," which market are you referring to?
The DW (and database) market has some aspects of a commodity market, where basic stats such as price/performance and cost per query are measured and compared, and some innovative aspects where new features come out and people pay a premium.
Column stores are an example of innovation that significantly changed the commodity metrics -- they're several times faster on most DW workloads, and that feature alone sold billions of dollars worth of software.
Snowflake came along without a number of the features which make column stores fast. It lags in most benchmarks, including internal ones I've done myself. It doesn't even delete data well.
But it is "the data cloud". I to be honest have no idea what that means. I run a large ad sales application that pulls data from an S3 data lake, ingests and exports data to a number of cloud destinations, external customers, and ML models, and delivers queries to a web application in 480 ms median, for about $2 per.
None of this uses Snowflake, but if you whisper the words "data cloud" into our VPs' ears, they will rapturously recite how it has transformed our organization.
General comments:
1) Costs don't exist in a vacuum. Imagine a world where you have 20 full time DBAs to stand up and tune an enterprise MPP platform and after you migrate to Snowflake you use a single FTE to manage it. Imagine not having backups. Imagine not vacuuming tables, etc. Imagine resizing a cluster in under a second up or down and having them shut down automatically when not in use so you don't have to pay for them. Lots of vendors in this space conveniently leave out the maintenance tasks from their cost calculators. And some vendors hide a lot of their cost in the customer's cloud provider bill separate from the price they charge the customer directly.
2) Many benchmarks aren't representative of a customer's real workload nor of how customers use Snowflake in practice.
3) Traditionally Snowflake has targeted analytic use cases and hasn't targeted operational use cases. If you have a 480ms SLA query, that is probably operational in nature. Last year Snowflake announced Unistore / hybrid tables to target exactly this type of use case, and that feature is in Private Preview now.
4) Think of the Data Cloud as both the platform as well as data in that platform potentially available via live sharing. Take Salesforce.com. Very popular system, 150k customers. Today, if you want to get data out quickly, you probably have to call an API, and tons of ETL vendors exist today that will connect to that API, download the data, and then load it into your own database. If there are 70 tables you want to pull down, maybe you have 70 jobs. You may also be subject to Salesforce API limits. Where does the Data Cloud come in? Last September Salesforce announced Genie which has an option for a live Snowflake share which is currently in Private Preview. Something happens in Salesforce and it will be available via query in Snowflake. Without running any jobs. Immediately. This data doesn't get downloaded from your Salesforce into your Snowflake account, yet you can query it like it is.
Now imagine that it isn't just Salesforce. It is every partner, vendor, and customer you ever have to exchange data with. Live sharing in a secure, governed way also opens up use cases like clean rooms.
All that makes sense, and I get why it's all potentially valuable, though I don't necessarily see why that means the "data cloud" branding is necessary. Again, it clearly works, so who am I question it. But to the iPhone point, you could certainly list a bunch of things that make an iPhone way more than phone. Yet, we still talk about them as phones, not as some new thing. You could potentially even go further than that, and say the Apple _went back_ to the phone branding by not launching the iPhone as a PDA or something like that.
(Of course, this is all mostly academic and doesn't really matter, but that could also be the title of this blog - MostlyAcademicAndDoesntReallyMatter dot substack dot com)
Gotcha. Yeah, that seems like the argument for both sides of this in a nutshell, actually.
For data folks, it seems like Snowflake would've been better off saying this is the new definition of a warehouse, and try to make the features they have (and not the features they don't) what's expected of any warehouse you buy. But, the data cloud bit seems to work with VPs, so, I guess they know what they're doing.