19 Comments

all great points. the community growth stuff is the standard open source playbook. wondering - do you have opinions on what part of the data stack should be open source or does it not matter to you in the grand scheme of things?

Expand full comment

I think that's a really tough question, mostly because there's an kind of ideal, in which we all do things around nice clean standards and nobody ever duplicates anything, and what's practical, in which we all rebuild the same stuff because there's some benefit to having a proprietary version that you can control and update as you see fit.

I'm not sure what the rule for that is though, and I think it's really hard to come up with a general principle of what should be open source, and then apply it to the various pieces of the stack. So as I say that, it occurs to me that maybe it works better to do it inductively, and figure out which pieces could be open source, and see if there's a general rule around them? Like:

- Databases - Would love a single OS query language, but the compute and infrastructure can be proprietary.

- Storage - OS storage format? Sure.

- ETL - A standard like Singer for how to extract data from SaaS apps and load them into the OS format seems good? But not sure you can start with the standard there; I suspect you have to have a dominate player OS their approach, and use their weight to force consolidation around that standard.

- Transformation - dbt is probably right here, have an OS transformation paradigm so we can all do it with some consistency.

- Data science stuff - OS the languages and packages, yes please.

- BI - Would like there to be a standard OS way to build visualizations (something like Vega, but better), but I don't think we'll ever really get there, because building viz is too expensive to do open source, and there's too much that people will want to customize.

- Metadata - some OS format for having tools hand metadata back and forth to each other would be good, but we're not getting anywhere on that. There are efforts to build the standard, but like ETL, I think it has to come from someone being dominate and then OSing their stuff, not the other way around.

So is there a rule for that? I'm not sure. The best I can come up with is something to do with interfaces. If the tool is a common interface - either between tools or between a tool and a user - it'd probably be good to have some OS standard if we could. But I don't know that we'll ever get there with most of it.

Expand full comment

yup very realistic and straightforwardly put. this "be dominant and then OS their stuff" is one way yes but perhaps airflow/astronomer and to a lesser but growing extent dagster - with orchestration being the one "fourth category" you missed in your list - might be the exception?

> some OS format for having tools hand metadata back and forth to each other would be good, but we're not getting anywhere on that.

i had high hopes for https://github.com/open-metadata/OpenMetadata and pushed for us to adopt but there wasn't a ton of buy in. not given up hope yet, feel like there really doesnt need to be a ton of innovation here

Expand full comment

Yeah, whoops, I completely forgot about orchestration in that list. Funnily enough, I'd say that's something that doesn't feel like it matters that much if it's OS? That just feels like an application we all need in some form, but there's no reason why it'd be particularly useful if it's OS or just a vendor. Even standardization doesn't really matter that much. But it seems like we've ended up in a world the expectations is for it to be OS because Airflow was OS it was the first popular one, so everyone believes theirs has to be OS too. But had Airflow been a paid product from day 1, I suspect we'd just see it like a standard vendor category. (There is a counterargument to that, which is that Airflow had to be OS to get popular, because OS is free, and nobody would've paid for Airflow. Which, to me, is really just a statement about how valuable Airflow (and maybe orchestration more generally) actually is.)

And yeah, agreed on the Open Metadata stuff. It feels like most versions of those things are the xkcd comic, where people set out to create a new standard. I think that's kind of a corrupting ambition, and you instead have to set out to make the next customer happy, until you become a standard.

Expand full comment

as a former Temporal employee i'm probably koolaid laden here but i do think orchestration should be open source - its central to the stack and annoying to replace so if a vendor goes away or rugpulls you have every incentive to want to own and operate that code yourself. its also the only way that people like netflix and coinbase were ever going to use us since they had to customize so much of it - neat for getting logos yes, but also helped in recruiting employees and they also contributed fixes and tooling and content to us which at their scale was pretty invaluable. perhaps you can chalk all that up to marketing but there's some product development in there that comes as a result of being an open source orchestrator.

thanks for the public thinking, very helpful :)

Expand full comment

So that makes sense, but my question would be, when does that argument *not* make sense? Like, could Netflix not make that argument for anything? E.g., "We don't want to use Salesforce because our entire business is run on Salesforce and what if they rugpull us? And we need to customize our CRM like crazy so we need an open source version so we can do that ourselves." And yet...lots of people buy Salesforce? I have no answer here, but it does seem weird.

(And of course; much more importantly, thanks for reading and putting up with it.)

Expand full comment

Ya part of me wants to believe a marketplace could work - but the long tail problem exists and the quality / consistency problem also exists. (For example - I have had the quality / consistency problem with Shopify apps.) And honestly - I’m not too mad about it. Good for Fivetran for executing on a “boring” problem in an quality way.

Expand full comment

That's my third theorem of startups: The best ideas are the ones that are nothing but gross grunt work that nobody wants to do. (Ok I have no theorems of startups but if I did this would be one.)

Expand full comment

Yes! Let me know when you come up with the first two?

Expand full comment

Excellent.

Expand full comment

I’m glad you brought up Airbyte. When I first heard of Airbyte I was soo excited. I thought the concept was great - a continuation of singer.io open source goodness but with a UI and taking things to the next level. Companies everywhere could create and share integrations to common sources and Airbyte could become a defacto standard. Then I tried to implement Airbyte in production. I knew the orchestration part would take some effort but what I didn’t expect was problems with a number of very common connectors. I was very disappointed. Then I thought to myself.. but this company is worth billions... how is this possible. Well - I think you spelled it out nicely. I’m still hopefully for Airbyte and haven’t used their hosted solution - so maybe they will get things buttoned up soon. HOWEVER this quote from Tristan Handy’s Becoming Pangea post (where he quotes fivetran CEO George Fraser) I think is true. “Building high quality connectors is hard, and there are a LOT of them to build and maintain. Customers highly value the quality dimension—everything needs to just work. So what you need is a huge customer base to amortize the creation and maintenance costs of these connectors over. If you have a huge customer base that is well-monetized, you can spend more on the connectors and they will therefore be of higher quality.” I hope Airbyte can get that quality elevated and provide some healthy competition for the ETL space.

Expand full comment

Yeah, open sourcing things like this always felt like a good and noble idea in theory, but very hard to make work in practice. You either have to rely on the kindness of OS contributors to do it for the greater good and social glory or be a marketplace for people to monetize the connectors. Either of those might work for a handful of the common connectors, but it doesn't work for the long tail, because there's no glory or money in building that. (And this is especially bad for this specific problem, because you don't just need to build it; you need to constantly maintain it.) But customers need that long tail, so the vendor probably has to build it themselves. And at that point, you're just Fivetran, without being able to fully monetize the popular connectors you need to pay for the tail connectors.

Expand full comment

Love it.... are 2023 valuations a dead cat bounce?

Expand full comment

DON'T OPEN THE BOX

Expand full comment

" Martin Casado, an investor at Andreessen Horowitz, ... and that, “Workday, Salesforce, Adobe—they’re going to be reimplemented as apps on top of the data layer.”"

What the hell does that even mean?

Every single computing application ever uses data. Data is the stuff that is computed on. A computer and software is nothing without data.

VCs are more silly than not.

Expand full comment

His point was that we're going to build a bunch of apps that read directly from warehouses like Snowflake and Databricks, rather than the data living in each app individually. The idea was (is?) that rather than each app maintaining their own data, apps are just consumers of a centralized source. That way, 1) companies own the data themselves, 2) there's one shared source of truth, 3) apps can be more easily extended to include data they wouldn't traditionally have themselves. It makes some rough theoretical sense to me, though is probably very very hard to pull of in practice (apps don't want to give up control, performance, cost, etc).

Plus, we've collectively moved on from being excited about that to everyone saying, ok, what if we rebuild this on top of AI infrastructure instead?

Expand full comment

Yeah, I think there is a slim possibility this will happen - less than 10%. It's technically possible, for sure (even though the Snows, et. al. are not there yet for large scale transaction demands), but the economic incentives are not there. Even when large vendors have tried to build enterprise application monoliths, it's not really worked out well.

Expand full comment

I'd put the eventual chances as higher than that, I think, because it's already happened before with BI. BI tools used to manage their own data; then, they became layers that sit on top of warehouses. I could see other apps building some things on top of the same centralized infrastructure. It may not be a warehouse exactly, but "you own your data" (especially if everyone's wanting to build AI stuff too) is a compelling enough pitch that it seems like we could get there eventually.

That said, if that happens, I suspect those apps will find other ways to do proprietary stuff. Sure, there might be a central data warehouse that the apps use, but they'll also use some of their own data in some other way, so that they have their special thing to sell you. So we end up solving the problem in name only.

Expand full comment