I'll age myself by saying that I remember Impala, and also was at the first Databricks Strata tutorial circa 2014. I started using both Databricks and Snowflake relatively early on--2016/2017. And while I love, love, LOVE your posts, I think there's one thing you get wrong here. Before roughly 2019, Databricks wasn't at all "a big, fast database that you can write SQL and Python against." Yes, you could query tables with SQL, but all of the underlying stuff you had to do with s3 and cluster management made it feel a lot more like Hadoop than Redshift or Snowflake. So much so that my DS/ML teams used Databricks because we liked python, but it was totally infeasible to make our Analytics team use it instead of Snowflake.
That all changed in 2020 when Databricks released Delta and very slowly integrated it into their product offering. Delta is basically OSS Snowflake, and since then, Databricks and Snowflake have been slowly converging. Finally in the last year or so Delta feels a lot like Snowflake (with a nice UI, simplified SQL clusters like Snowflake warehouses, etc.). So it really is a big, fast database that you can program with python, scala, SQL.
Approaching from the other direction, Snowflake has tried to open itself up to python with Snowpark, where they essentially copied the Spark API, but as far as I can tell it's mostly just marketing hype. I don't think Snowpark python is even generally available yet.
So I agree--you're totally right about how Databricks should be marketing itself now. But I think their tech couldn't back that up before the last year or two... Not that that usually stops the marketing people. But maybe as reluctant academics that had a bit more shame?
That's fair, though I think those two things are intertwined. The mistake wasn't strictly failing to market a product that could've been a big, dumb database; it was not seeing that they were a couple steps away from building a big, dumb database that would've been really valuable to build. It seems like, in both the product and marketing, they were tied to this grander vision.
That said, to your point about Snowpark, that's Databricks' opportunity now. If they sort this stuff out, I think they have a higher ceiling than Snowflake, but to get there, they've got to sell beat Snowflake on the meat and potatoes "I need a database" deals that Snowflake seems extremely proficient at selling.
This argument feels akin to claiming Airbnb should have just been another OTA just marketing hotels. Databricks is helping increase the TAM by making AI/ML easier and more accessible while also cementing their position well-ahead of others in the area. They are helping enable new use cases rather than just replacing vendors for existing ones. Similar to how Airbnb is expanding into Hotels, Databricks is now (for the past 2 years) expanding into general analytics and "boring" database stuff. If you [Databricks] are in this for the long-run, it seems like the smart approach to me. Do you disagree?
I do. If you've got the ability to be 1) a much better version of the thing that people already have, or 2) something that entirely new that people don't quite understand, I think 1) is a better path. Solving new problems is a tougher sell than something that people already know how to use, assess, measure, and implement.
On the Airbnb analogy, I'd see that differently. Airbnb did market directly to people who want hotels. It wasn't a new use case; it was a new form for the same problem. Had Airbnb done what Databricks did, they would've started with something like Experiences, which would be closer to a "new use case rather than replacing an existing one."
As someone who has been in the data industry for a long time, and who spent the year between 2012 and about 2018 feeling vaguely stupid much of the time for my inability to mentally stitch together the myriad Big Data technologies that were constantly emerging, merging and disappearing during that time, I find this post to be extremely soothing. Perhaps there is some kind of entropic data tech law that dictates that, eventually, all data tech becomes databases?
Thanks! And I suspect there's something to that, where most products eventually collapse down into a handful of things. At the end of the day, we're all either building a database, a data pipeline, or a BI a tool, no matter how much we say our thing is different.
Love the thoughts and agreed with the structure of relevant prior art.
2 thoughts:
1) Doesn't it make sense for Databricks in 2015 to be "a better Hadoop" for companies with Uber or Pinterest-sized data, and Snowflake to be "a better Redshift" for companies with smaller-sized data? In that Venn Diagram, there are some companies that cross over, but many won't for dozens of years.
2) What are your thoughts on what role these tools will play in the next shift to a better architectural pattern (aka Data Mesh)? This arch evolution is being driven not by tooling but by internal org structure / drift in knowledge management. It's why imo data catalogs haven't worked; the organizations haven't rly iterated to produce a novel org structure capable of maintaining data.
On 1), that would make sense if Databricks could actually scale better than Snowflake, but I don't think that was the case, at least not in a meaningful way. So Snowflake works for people with both small and big data. Plus, if you're market is uber sized data, you can't sell to that many people. The boring masses is a much bigger market than a few cutting edge companies (that are also inclined to build internal solutions for their very specific use cases).
On 2), I think both Databricks and Snowflake help there, because they make data centralization actually possible. That doesn't mean the whole data stack should be centralized, but starting from a centralized core and fanning out is almost certainly easier to manage than some loose network of departmental data tools.
You missed out Azure Synapse as Microsoft's potential alternative to Databricks. It's (currently) still behind Databricks in terms of some key features, and the cost for a dedicated SQL Pool in Synapse is still a bit hard to swallow, but MS is moving fast. The Synapse team is working hard to make it super easy to use for young / small Analytics teams. It will be interesting to watch how the Azure Synapse / Databricks relationship evolves over the next year or two.
Yeah, I imagine a lot of the partnerships in the space start to become a lot more standoffish. That's already happened some with Snowflake and AWS, and could see it happening with databricks and Microsoft.
Great read. Today I was just chatting with one of these companies you listed, and mentioned a few things that echo your points. First, communication and marketing are everything. From day one, Snowflake knew how to sell to the enterprise. This cannot be understated. Their growth is directly related to knowing what enterprises want, and delivering it in a way that's stupidly simple to understand. The "high IQ" vendors somehow struggle with this. As my old boss said, "when the customer wants to buy, shut up and take the sale." Second, Big Data died many years ago, and the companies still pitching it are like the zombies in Walking Dead that are getting brained left and right. Third, the dark horse the big incumbent DW/DLH vendors need to watch out for is the "live data stack", where applications, real-time, next-gen OLAP, and ML have a seamless feedback loop that basically nullifies the existing MDS paradigm. That's coming...
P.S. Longtime Spark and DB user since 2014, so very familiar with its evolution
Your last point is why I think Databricks could win this whole thing, if they figure out your first point. They have more capacity for being high-ceiling data science/ML/application infrastructure, but they have to make sure that doesn't get in the way of making the simple sale.
However, the main point i am getting from this article is that the mistake Databricks made is around sales and marketing. That has never been an issue for me. The initial hype from the demo drew me in when I attended Strata back in 2015. I set up a POC immediately and thought it was amazing, but didnt touch it again for a couple years. Fast forward two jobs and many Hadoop headaches later and I gladly jumped back in to it.
I hate empty 'solutions-oriented' pitches as much as anyone, but I do like the unified analytics platform they promote. I currently work in an organization with a small data staff. Having data science and data engineering in the same platform works really well. I also just really like working with Databricks. The notebook structure is great. I like being able to switch from SQL to python and (rarely) R/Scala. Scheduling ETL jobs is simple (it's just a notebook!). Being able to develop machine learning models on the same platform is key for us too. Databricks support has also been great, especially considering we do not spend much with them.
Full disclosure, I have never used Snowflake, or dbt for that matter. I know those are quite popular right now. I am definitely curious, but I just don't have an opportunity to use them. I also don't see a need. Is there any reason other than the sales/marketing pitch that you prefer Snowflake/dbt? Cost? Simplicity? Functionality?
On the mistake being sales and marketing, I think that's true, though I don't see that as entirely separable from product. Marketing and product amplify each other, so if you're marketing message is Big Data Platform, you'll build a product that tilts that way.
The thing that Snowflake did really well is they pitched a product that anyone could buy. You're an enterprise with tons of wild data needs? Snowflake can help. You're a mid sized business running SQL Server? Snowflake can help. You've got an MySQL database and are looking for your first analytical warehouse? Snowflake can help.
As best I can tell, because Databricks talked a lot about all of these advanced features they had, which made it seem like you had to have particular needs to use it.
I'll age myself by saying that I remember Impala, and also was at the first Databricks Strata tutorial circa 2014. I started using both Databricks and Snowflake relatively early on--2016/2017. And while I love, love, LOVE your posts, I think there's one thing you get wrong here. Before roughly 2019, Databricks wasn't at all "a big, fast database that you can write SQL and Python against." Yes, you could query tables with SQL, but all of the underlying stuff you had to do with s3 and cluster management made it feel a lot more like Hadoop than Redshift or Snowflake. So much so that my DS/ML teams used Databricks because we liked python, but it was totally infeasible to make our Analytics team use it instead of Snowflake.
That all changed in 2020 when Databricks released Delta and very slowly integrated it into their product offering. Delta is basically OSS Snowflake, and since then, Databricks and Snowflake have been slowly converging. Finally in the last year or so Delta feels a lot like Snowflake (with a nice UI, simplified SQL clusters like Snowflake warehouses, etc.). So it really is a big, fast database that you can program with python, scala, SQL.
Approaching from the other direction, Snowflake has tried to open itself up to python with Snowpark, where they essentially copied the Spark API, but as far as I can tell it's mostly just marketing hype. I don't think Snowpark python is even generally available yet.
So I agree--you're totally right about how Databricks should be marketing itself now. But I think their tech couldn't back that up before the last year or two... Not that that usually stops the marketing people. But maybe as reluctant academics that had a bit more shame?
That's fair, though I think those two things are intertwined. The mistake wasn't strictly failing to market a product that could've been a big, dumb database; it was not seeing that they were a couple steps away from building a big, dumb database that would've been really valuable to build. It seems like, in both the product and marketing, they were tied to this grander vision.
That said, to your point about Snowpark, that's Databricks' opportunity now. If they sort this stuff out, I think they have a higher ceiling than Snowflake, but to get there, they've got to sell beat Snowflake on the meat and potatoes "I need a database" deals that Snowflake seems extremely proficient at selling.
This argument feels akin to claiming Airbnb should have just been another OTA just marketing hotels. Databricks is helping increase the TAM by making AI/ML easier and more accessible while also cementing their position well-ahead of others in the area. They are helping enable new use cases rather than just replacing vendors for existing ones. Similar to how Airbnb is expanding into Hotels, Databricks is now (for the past 2 years) expanding into general analytics and "boring" database stuff. If you [Databricks] are in this for the long-run, it seems like the smart approach to me. Do you disagree?
I do. If you've got the ability to be 1) a much better version of the thing that people already have, or 2) something that entirely new that people don't quite understand, I think 1) is a better path. Solving new problems is a tougher sell than something that people already know how to use, assess, measure, and implement.
On the Airbnb analogy, I'd see that differently. Airbnb did market directly to people who want hotels. It wasn't a new use case; it was a new form for the same problem. Had Airbnb done what Databricks did, they would've started with something like Experiences, which would be closer to a "new use case rather than replacing an existing one."
De-fi possibilitou trafegar pela blockchain mais rapidamente para construir em varios conteiner e varias linguas
As someone who has been in the data industry for a long time, and who spent the year between 2012 and about 2018 feeling vaguely stupid much of the time for my inability to mentally stitch together the myriad Big Data technologies that were constantly emerging, merging and disappearing during that time, I find this post to be extremely soothing. Perhaps there is some kind of entropic data tech law that dictates that, eventually, all data tech becomes databases?
Thanks! And I suspect there's something to that, where most products eventually collapse down into a handful of things. At the end of the day, we're all either building a database, a data pipeline, or a BI a tool, no matter how much we say our thing is different.
did someone say tarot? 🔮
Love the thoughts and agreed with the structure of relevant prior art.
2 thoughts:
1) Doesn't it make sense for Databricks in 2015 to be "a better Hadoop" for companies with Uber or Pinterest-sized data, and Snowflake to be "a better Redshift" for companies with smaller-sized data? In that Venn Diagram, there are some companies that cross over, but many won't for dozens of years.
2) What are your thoughts on what role these tools will play in the next shift to a better architectural pattern (aka Data Mesh)? This arch evolution is being driven not by tooling but by internal org structure / drift in knowledge management. It's why imo data catalogs haven't worked; the organizations haven't rly iterated to produce a novel org structure capable of maintaining data.
On 1), that would make sense if Databricks could actually scale better than Snowflake, but I don't think that was the case, at least not in a meaningful way. So Snowflake works for people with both small and big data. Plus, if you're market is uber sized data, you can't sell to that many people. The boring masses is a much bigger market than a few cutting edge companies (that are also inclined to build internal solutions for their very specific use cases).
On 2), I think both Databricks and Snowflake help there, because they make data centralization actually possible. That doesn't mean the whole data stack should be centralized, but starting from a centralized core and fanning out is almost certainly easier to manage than some loose network of departmental data tools.
Clickhouse says Hi
You missed out Azure Synapse as Microsoft's potential alternative to Databricks. It's (currently) still behind Databricks in terms of some key features, and the cost for a dedicated SQL Pool in Synapse is still a bit hard to swallow, but MS is moving fast. The Synapse team is working hard to make it super easy to use for young / small Analytics teams. It will be interesting to watch how the Azure Synapse / Databricks relationship evolves over the next year or two.
Synapse V3 has been "coming soon" for 3 years now. Customers are getting angry.
they mean soon in a geological sense
Yeah, I imagine a lot of the partnerships in the space start to become a lot more standoffish. That's already happened some with Snowflake and AWS, and could see it happening with databricks and Microsoft.
Great read. Today I was just chatting with one of these companies you listed, and mentioned a few things that echo your points. First, communication and marketing are everything. From day one, Snowflake knew how to sell to the enterprise. This cannot be understated. Their growth is directly related to knowing what enterprises want, and delivering it in a way that's stupidly simple to understand. The "high IQ" vendors somehow struggle with this. As my old boss said, "when the customer wants to buy, shut up and take the sale." Second, Big Data died many years ago, and the companies still pitching it are like the zombies in Walking Dead that are getting brained left and right. Third, the dark horse the big incumbent DW/DLH vendors need to watch out for is the "live data stack", where applications, real-time, next-gen OLAP, and ML have a seamless feedback loop that basically nullifies the existing MDS paradigm. That's coming...
P.S. Longtime Spark and DB user since 2014, so very familiar with its evolution
Your last point is why I think Databricks could win this whole thing, if they figure out your first point. They have more capacity for being high-ceiling data science/ML/application infrastructure, but they have to make sure that doesn't get in the way of making the simple sale.
As usual, I enjoyed reading the article.
However, the main point i am getting from this article is that the mistake Databricks made is around sales and marketing. That has never been an issue for me. The initial hype from the demo drew me in when I attended Strata back in 2015. I set up a POC immediately and thought it was amazing, but didnt touch it again for a couple years. Fast forward two jobs and many Hadoop headaches later and I gladly jumped back in to it.
I hate empty 'solutions-oriented' pitches as much as anyone, but I do like the unified analytics platform they promote. I currently work in an organization with a small data staff. Having data science and data engineering in the same platform works really well. I also just really like working with Databricks. The notebook structure is great. I like being able to switch from SQL to python and (rarely) R/Scala. Scheduling ETL jobs is simple (it's just a notebook!). Being able to develop machine learning models on the same platform is key for us too. Databricks support has also been great, especially considering we do not spend much with them.
Full disclosure, I have never used Snowflake, or dbt for that matter. I know those are quite popular right now. I am definitely curious, but I just don't have an opportunity to use them. I also don't see a need. Is there any reason other than the sales/marketing pitch that you prefer Snowflake/dbt? Cost? Simplicity? Functionality?
Thanks
On the mistake being sales and marketing, I think that's true, though I don't see that as entirely separable from product. Marketing and product amplify each other, so if you're marketing message is Big Data Platform, you'll build a product that tilts that way.
The thing that Snowflake did really well is they pitched a product that anyone could buy. You're an enterprise with tons of wild data needs? Snowflake can help. You're a mid sized business running SQL Server? Snowflake can help. You've got an MySQL database and are looking for your first analytical warehouse? Snowflake can help.
As best I can tell, because Databricks talked a lot about all of these advanced features they had, which made it seem like you had to have particular needs to use it.
O Big data não foi o erro
assim como o snow
a spark e o ATM são processamentos derivados do BD
a tecnologia avança muito rapido
quantos anos durara snow?