Databricks, Snowflake, and the end of an overhyped era.
I'll age myself by saying that I remember Impala, and also was at the first Databricks Strata tutorial circa 2014. I started using both Databricks and Snowflake relatively early on--2016/2017. And while I love, love, LOVE your posts, I think there's one thing you get wrong here. Before roughly 2019, Databricks wasn't at all "a big, fast database that you can write SQL and Python against." Yes, you could query tables with SQL, but all of the underlying stuff you had to do with s3 and cluster management made it feel a lot more like Hadoop than Redshift or Snowflake. So much so that my DS/ML teams used Databricks because we liked python, but it was totally infeasible to make our Analytics team use it instead of Snowflake.
That all changed in 2020 when Databricks released Delta and very slowly integrated it into their product offering. Delta is basically OSS Snowflake, and since then, Databricks and Snowflake have been slowly converging. Finally in the last year or so Delta feels a lot like Snowflake (with a nice UI, simplified SQL clusters like Snowflake warehouses, etc.). So it really is a big, fast database that you can program with python, scala, SQL.
Approaching from the other direction, Snowflake has tried to open itself up to python with Snowpark, where they essentially copied the Spark API, but as far as I can tell it's mostly just marketing hype. I don't think Snowpark python is even generally available yet.
So I agree--you're totally right about how Databricks should be marketing itself now. But I think their tech couldn't back that up before the last year or two... Not that that usually stops the marketing people. But maybe as reluctant academics that had a bit more shame?
As someone who has been in the data industry for a long time, and who spent the year between 2012 and about 2018 feeling vaguely stupid much of the time for my inability to mentally stitch together the myriad Big Data technologies that were constantly emerging, merging and disappearing during that time, I find this post to be extremely soothing. Perhaps there is some kind of entropic data tech law that dictates that, eventually, all data tech becomes databases?
did someone say tarot? 🔮
Love the thoughts and agreed with the structure of relevant prior art.
1) Doesn't it make sense for Databricks in 2015 to be "a better Hadoop" for companies with Uber or Pinterest-sized data, and Snowflake to be "a better Redshift" for companies with smaller-sized data? In that Venn Diagram, there are some companies that cross over, but many won't for dozens of years.
2) What are your thoughts on what role these tools will play in the next shift to a better architectural pattern (aka Data Mesh)? This arch evolution is being driven not by tooling but by internal org structure / drift in knowledge management. It's why imo data catalogs haven't worked; the organizations haven't rly iterated to produce a novel org structure capable of maintaining data.
Clickhouse says Hi
You missed out Azure Synapse as Microsoft's potential alternative to Databricks. It's (currently) still behind Databricks in terms of some key features, and the cost for a dedicated SQL Pool in Synapse is still a bit hard to swallow, but MS is moving fast. The Synapse team is working hard to make it super easy to use for young / small Analytics teams. It will be interesting to watch how the Azure Synapse / Databricks relationship evolves over the next year or two.
Great read. Today I was just chatting with one of these companies you listed, and mentioned a few things that echo your points. First, communication and marketing are everything. From day one, Snowflake knew how to sell to the enterprise. This cannot be understated. Their growth is directly related to knowing what enterprises want, and delivering it in a way that's stupidly simple to understand. The "high IQ" vendors somehow struggle with this. As my old boss said, "when the customer wants to buy, shut up and take the sale." Second, Big Data died many years ago, and the companies still pitching it are like the zombies in Walking Dead that are getting brained left and right. Third, the dark horse the big incumbent DW/DLH vendors need to watch out for is the "live data stack", where applications, real-time, next-gen OLAP, and ML have a seamless feedback loop that basically nullifies the existing MDS paradigm. That's coming...
P.S. Longtime Spark and DB user since 2014, so very familiar with its evolution
As usual, I enjoyed reading the article.
However, the main point i am getting from this article is that the mistake Databricks made is around sales and marketing. That has never been an issue for me. The initial hype from the demo drew me in when I attended Strata back in 2015. I set up a POC immediately and thought it was amazing, but didnt touch it again for a couple years. Fast forward two jobs and many Hadoop headaches later and I gladly jumped back in to it.
I hate empty 'solutions-oriented' pitches as much as anyone, but I do like the unified analytics platform they promote. I currently work in an organization with a small data staff. Having data science and data engineering in the same platform works really well. I also just really like working with Databricks. The notebook structure is great. I like being able to switch from SQL to python and (rarely) R/Scala. Scheduling ETL jobs is simple (it's just a notebook!). Being able to develop machine learning models on the same platform is key for us too. Databricks support has also been great, especially considering we do not spend much with them.
Full disclosure, I have never used Snowflake, or dbt for that matter. I know those are quite popular right now. I am definitely curious, but I just don't have an opportunity to use them. I also don't see a need. Is there any reason other than the sales/marketing pitch that you prefer Snowflake/dbt? Cost? Simplicity? Functionality?
O Big data não foi o erro
assim como o snow
a spark e o ATM são processamentos derivados do BD
a tecnologia avança muito rapido
quantos anos durara snow?