I spoke to Lloyd a week ago about dbt models that could express their logic as Malloy! Way ahead of you :P
I have zero confidence that this will actually happen in some kind of near-term, but I say it just because I agree with you. I don't think dbt-as-a-framework should insist that you author your business logic in SQL, it's just that this was the best place to get started. Imperative languages like Python, languages-that-compile-to-SQL--all of these are reasonable ways to express business logic. It's just important that they can all execute in the cloud data platform and not require data movement. Ultimately it would be great if dbt-as-a-framework could understand the syntax of each of these expressions and use that knowledge to not just operate on your business logic as text (as it does today) but know exactly what you are expressing.
Lots and LOTS of work to do.
I do think that dbt's concept of a metric will allow you to do exactly what you're describing with the sales funnel though. We do not currently define a metric of type "funnel" but if you could describe how that should work we could--or you could--implement this (the metrics definitions themselves are all in user-space and all Apache 2.0). Define what a sales funnel is, configure as many of this type of metric as you please.
I have to admit, I didn't think about dbt *and* Malloy working together...
And all of that makes sense in the longer term. I think it'll be really interesting to see how far Jinja gets stretched by what people do with metrics. It feels like entire mini-languages could get built within it, to the point where a lot of what you're describing—an imperative language that compiles to SQL—almost falls out by accident.
Great post Benn and thanks for the shoutout. There is a lot of power in the relational model that isn’t accessible via SQL. This is what inspired us at RelationalAI to work on a relational language and system to help people get further than they can get with SQL alone without having to give up the relational model and all its goodness. It’s also exciting to see Tristan write that dbt-as-a-framework has aspirations to support languages like Malloy. Malloy is a great example of a relational language that is not SQL. You can find many other cool and mind-blowing examples of relational languages at Jamie Brandon’s conference on HYTRADBOI (have you tried rubbing a database on it?).
Thanks! There was an idea there I couldn't quite figure out how (/ran out of time) to articulate, which is how something like RelationalAI potentially collapses the awkward game of telephone that semantic layers play today. People want to write SQL, and databases want to speak SQL, but to inject a semantic model in the middle, we go query to semantic translation and back to SQL again. If the database is built on semantic models rather than relational ones, the first two steps are the same—query to semantic translation—but you get to skip the third.
Let me see if I can explain how we try to avoid the telephone game. Apologies in advance if this is more detail than you wanted. I’m happy to discuss further when we talk next.
Let’s first start by making a distinction between the relational paradigm and SQL. When the relational paradigm was invented 50 years ago, the two core languages that defined it where mathematical. One was called the relational algebra and the other was called the relational calculus. The early relational database community realized that most people would find them hard to use so they went about inventing relational languages that were more accessible. SQL was one of many competing proposals (QUEL, QBE, ISBL, PIQUE are examples of many others). For a variety of reasons SQL emerged as the industry standard. Systems that implement SQL (or any relational query language for that matter) will typically translate SQL to relational algebra which is the mathematical language that database engines really understand. This is true for modern cloud database management systems like Snowflake, BigQuery, Redshift, etc.
The cool thing about SQL and other relational languages is that they are declarative — in other words they remove the burden from the user having to write procedural code telling the computer step by step how to compute something efficiently and safely. When we write programs procedurally in languages like Python, Java, or C#, we have to write code that we wouldn’t have to write if we used SQL. For example, we have to write code to implement the right data structures and the right algorithms. SQL takes care of query optimization and evaluation for us. We also might have to write code to make it possible to run the algorithm on many cores or write code to manage data sets larger than memory. SQL takes care of parallelization and memory management for us. Writing procedural code requires us to have more expertise and to make more effort. In my experience we need to have a few more years of technical education and we need to expend 10 times (or more) effort to write the procedural code that implements non-trivial application logic vs what is required to write it in SQL. This is why there are many more people who can write SQL than people who write procedural code.
If writing procedural code is so painful then why does anyone do it at all? Wouldn’t it be great if our relational systems would just let us write all our application logic in a relational language like SQL so that we don’t have to worry about all that procedural complexity? Of course it would be but until recently computer scientists didn’t know how to build relational database management systems that can express and run sophisticated application logic. For example, powerful relational databases like Snowflake and BigQuery still recommend that you use other technology for graph and reasoning workloads. You also have to leave them if you want certain predictive and prescriptive analytics. That’s why most application logic is written in procedural languages like Java, C#, and Python.
We’ve had to write procedural code out of necessity, not because it is less effort. Whenever we hit the limits of SQL systems, we built escape hatches that let us work around those limits by writing procedural code. One such escape hatch is writing procedural code to implement application logic outside the database. Another such escape hatch is writing code in a stored procedural language that runs inside the database. Regardless of where the code runs (inside or outside the database), we are on our own to make it efficient and scalable. The database engine has no idea what the code is trying to do — i.e. it doesn’t know the intended semantics. Because it doesn’t know the semantics, these powerful cloud data platforms cannot use their query optimization and evaluation machinery to help us out. To make matters worse, we seem to have developed a sort of Stockholm syndrome or a psychological bond with our procedural captors. We’ve learned to think of the escape hatch as a feature and not as a reminder of the limitations we have to work around.
The good news is that history has shown that it’s only a matter of time before computer scientists figure out how to make important workloads run on relational databases. In the 80’s they invented SQL, join algorithms, and paged data structures that made relational databases the standard for transactional operational applications. Who remembers CODASYL, the procedural/navigational technology relational databases replaced? In the 90’s computer scientists invented parallel joins and column stores that made SQL the standard for analytics and BI. Who remembers the array based MOLAP technology that relational databases replaced? Ten years ago computer scientists invented a new cloud-native architecture and new JSON language extensions to SQL to make relational databases like Snowflake the standard for Big Data analytics and data transformation (with a little help from our friends at dbt). Does anyone remember how Hadoop was supposed to make SQL and relational systems obsolete? Seemingly overnight, Hadoop went from being the future of data management to being Cobol. The next time we talk you should ask Bob to tell you his funny story about how as CEO of Snowflake he was turned down for funding 27 times just 3 years before Snowflake had the biggest software IPO in history.
If history has taught us anything, it is that the relational paradigm always wins.
So where do we go from here? How do we get to the promised land where we can easily build rich models of our business on the modern data stack with all its declarative, secure, and scalable SQL and relational goodness? My colleagues and I at RelationalAI have a modest proposal :-). We are building a *relational* knowledge graph management system based on brand new join and query optimization algorithms that make it possible to declaratively express and efficiently execute sophisticated business models and application logic. RelationalAI speaks SQL and extends it with a stored *relational* language called Rel (think Malloy with a different syntax and a few extra features). In this arrangement we can use semantic frameworks like dbt to build as much of the rich business model in SQL as possible. However, whenever we hit the limits of what SQL can express clearly or run efficiently, instead of having to leave the comfort of our relational cocoon and subject ourselves to the harsh complexity of procedural coding, we can use Rel to write queries and formulas that clearly express and efficiently compute the semantics of our business model. Since the output from these Rel queries is stored in tables and is queryable via SQL, we can continue to use whatever cloud database platform and analytical engineering tools we like (e.g. Snowflake, dbt, Tableau, Looker, etc).
I totally agree with Tristan’s comment that dbt-as-a-framework should not require that the business model and application logic/semantics be expressed only in SQL. I also totally agree that it is important that the application logic “execute in the cloud data platform and not require data movement”. However, it’s not clear to me that we want to express our application logic in procedural languages. Procedural code is much harder (and in some cases impossible) for our tools to analyze to understand the intended semantics. Without knowing the intended semantics, we can’t use our cloud data platform to automatically optimize it, parallelize it, incrementalize it, or determine data its lineage, etc. I think we should do everything we can to make it possible to build our business model and express our application logic in a relational language. Ideally it would be SQL but if we can’t do it in SQL we should fall back to a more expressive relational language running on a modern relational engine that can evaluate it efficiently.
Apologies for the long reply to your short question. I hope this helps explain what we have in mind. I’m looking forward to chatting about this some more offline.
p.s. If you agree with me that SQL is just one of many possible flavors of relational languages (just like there are many flavors of procedural languages) then I hope you allow me to suggest a minor change to an otherwise awesome post. I think it would have been clearer for the subheading “Business models aren’t relational models” to be written as “Business models aren’t SQL models”. They could be relational models if you express them in languages like Malloy or Rel.
Thanks! I really appreciate all of the detail here. There's a lot to talk about on this (and looking forward to hearing Bob's stories), but two things really stand out to me.
First (and this is place where I'm admittedly ignorant of my history), the general ratchet to declarative languages is something I hadn't considered. I remember the immediately demise of all things MapReduce as soon as we had SQL layers on top of Hadoop, but never generalized that. Thinking about other places where that applies - and in particular, places where declarative languages can replace procedural ones - puts that entire trend in a very different light.
Second, the idea of expressing semantic/application logic is something declarative (even if it's not SQL) feels very appealing to me. Things like LookML feel like they approach this is the right way in spirit, where I can specify how to model the various elements of my business or application, and Looker figures out what to do about it. But it feels very limiting in practice, partly because of the design constraints in LookML, and partly (I suspect) because of LookML's close ties to SQL itself. But, if implemented in a way that felt a little cleaner (which I think Malloy is a step closer to), that starts to feel very interesting.
A lot of the comments on Twitter about this post were focused on how SQL is fine for semantic modeling; the problem is the structure of the data, or the complexity of the business itself. But in the context of your comment, I feel like those points aren't quite right. SQL isn't very good for semantic and application modeling; we just want to think that it is, because declarative languages are good for semantic modeling. But SQL doesn't have to be the only declarative language we do it in, if only something better came along...
If you think about it, relational logic is extremely complicated when you introduce time into it, and similarly where SQL is wanting is when you need to to metrics with a sequence element.
I think you're often better off trying to define states (as opposed to recording events) in some other type of language too. Jinja doesn't seem like a good solution there.
gaaaaah now I'm upset I didn't think of anything to do with time as a better example of this. Turning contracts into daily ARR in a relational model is a nightmare, and I still don't know how to do it without doing something that feels like some heinous hack.
And agreed that Jinja doesn't make sense in its current form. But I could see a Rails/React-ified version working out, though I'm not sure what that would exactly look like.
Part of why SQL gets an opportunity to thrive by being so dumb is that it lives in an ecosystem where things around it are constantly doing the things it is bad at. Other software that can change something's state by having implicit assumptions about when business logic would be relevant instead of being explicitly and verbosely defined by a set of relational definitions, which means that SQL keep its own scope to stuff that is achievable with in its own rigid boundaries.
That makes sense, though it feels like sort of sad conclusion - "we have to keep doing this in a clumsy way because if we did it in a better way, we could do much with it and we'd hurt ourselves." I like the dumb thing (in no small part because I can understand it), but surely there's a way to better thread that needle.
@benn Funny, I worked with DHH on the first RESTful version of Rails, and recently I’ve been prototyping ways to bring that kind of “convention over configuration” thinking to dbt. Would love to brainstorm with you some possibilities for how to pull that off: https://calendly.com/drernie
The central idea is replacing “languages” with a Domain Specific Data Format that carries the semantics we care about using a generic syntax. In theory, this makes it easy for a wide range of tools to both consume and produce it, allowing arbitrarily sophisticated levels of automation and flexibility.
I love all of these posts. One thing that I think hasn't been discussed here yet, as far as I can see, is row-based versus set-based processing. Row-based processing is perhaps a bit more human intuitive (i.e. "Okay let's look at this case/thing, now this next case/thing, now this next case/thing") but it doesn't scale theoretically or practically. It's kind of like comparing deductive (row-based) to inductive (set-based) logic, although the metaphor isn't perfect. As programmers, which in one way is being applied mathematicians, if we are doing our jobs well, we are always moving towards the general case using inductive logic.
SQL in its basic form is mostly about set-based logic. And that's because relational algebra, which underlies SQL, is a branch of math that is about sets. If you force it, a la PL/SQL, cursors, or whatever other method, you can get it to do row-based processing, but that's usually a bad bargain for someone who isn't comfortable with set-based processing.
This is why, in practice, it often seems a struggle to get folks coming from a non-analytics-oriented software development background to transition to working efficiently with databases - because, in practice, other software development paradigms seem to naturally conform themselves to row-based processing. That might be just an artifact of the way we teach or describe it - I'm not saying it's inherent itself to, say, Python. I can't count the number of times I've spoken to someone and said "OK, having access to an API is nice, but now can I call the API in such a way that it gives me ALL the data rather than having to go row by row or a few records at a time, because I want to analyze the whole thing at one go?" This seems like such an intuitive request to me, but somehow that conversation regularly seems to be difficult and painful. And "painful" not necessarily from the practical concern of "it would be too costly to let you have all the data or expensive on the system, etc." but more like "Why in the world would you ever want more than a few rows at a time?"
There is probably some relationship between a language being declarative and being set-based but I'm not sure it's exactly the same thing.
Also, as others have noted here, SQL in some cases is declarative (within a given query). But.... it is very easy to accidentally code a series of SQL statements and realize that one has built the equivalent of a set of imperative transformations. Maybe not "imperative" in the sense of "read this line from a file", "replace all the cases of the letter 'a' with 'z'", "write a line", but imperative in the sense of "I as a human being can look across all of these intermediate, temporary query steps and see ultimately that there is meaningful output to this, but looking at it step by step, I'm giving very specific details to the computer regarding how to go about it that the computer can't easily optimize the various aspects of." dbt can help with that but it's not a 100% solution.
Thanks! On the row- vs set-based conversation, that's potentially an interesting explanation as to why thinking about data in SQL vs in Python feels so different. When I first started doing data work, most everything I did was in Excel or R. When I started working in SQL, it took me some time to reorient myself around the way SQL made you think (and this wasn't the declarative vs procedural I don't think; it was more about how I conceived of what data was). I've always thought that it was mostly because SQL pushed you to think of everything as relational tables. But it may be more that it breaks you away thinking strictly in rows.
And I like the point about declarative languages still leaving the door open for lots of mistakes. While you may not have to figure out the exact recipe to create the thing you want to create, you've got to describe something really complicated and nuanced. Even if SQL could be plain English - "I want a table with one column like this, and one like this, where each row represents this" - it's still really hard to exactly describe what you want. So in that regard, to most people, I suspect it's a difference without a distinciton.
I've been working in this space for 40+years. Problem is you do need a new language and ecosystem. You also need an AI engine. And there's the rub. The thing about the semantic layer is you can't program it all. You need something that understands how to convert your specification into working code and it can't be done statically. Record by record as process the query.
So here's what I do. https://unibase.zenucom.com. It's been in service for decades building complex high performance applications - ie it's not a proposal or a nice idea - it's a real hard working semantic database.
There's a data dictionary that defines tables, fields, calculations (that might include results from other tables). It also defines relationships between tables.
There's an AI engine that can understand all that so that when a field from a table is requested it can return the stored data, or do a series of calculations and data retrievals to give an answer.
No SQL, just layouts that get interpreted against the dictionary.
It's a big system, as you'd expect after all these years, but the important thing is the semantic layer and AI engine that makes it happen.
SQL is not non-procedural as every statement is an instruction to do something. Unibase is non-procedural as everything is a specification until a program is run to produce an output from the specification. This is not splitting hairs because you don't know how the specification will be interpreted until something tries to use it.
Yeah, the idea of specifications as semantics (though I only have a hand wavy notion of what I mean by that) feels very appealing to me. As Molham said in his comment above, I don't really thing of semantic modeling as a procedural process. I'm not trying to tell a computer how to solve I problem; I just want to describe the system to the computer, and have it figure out what to do about that.
Let me specify to you how my business works, and then you spit out all the implications of that.
However all normal computer processes are very procedural. That's why you need something else. A simple example is html. The html engine does all the layout, calculations, etc you just need to specify the layout. There's no real AI as there is a direct correlation between the specification and the display.
With Unibase the correlation is very indirect. This is because of the equation solver that looks at a lot of equations and works out the order to evaluate them and their dependencies. It works out when to get records from other tables, it works out how to get the right records, and it can return summed values for use in equations higher up.
The result is typical programs are very short. In the order of a 20 - 30 lines. Big programs reflect very complex data display. That's what you can't do easily with SQL and PHP, or React etc.
I really like your entries, among other things because you're aware of Microsoft's prior art. Many blogs from startups in the MDS space seem to ignore Microsoft - I'm not sure if it's out of ignorance, fear, hipster coolness, or contempt, but in any case it's weird to pretend one of the biggest BI/DB players doesn't exist.
To the point of this specific post: during the past decade, Microsoft has moved away from multidimensional cubes in SSAS and focused on tabular, putting the emphasis on Power BI and the DAX language (also available in Excel by the way). I have mixed feelings about DAX but you can't argue with the fact it's widely adopted. Yes, in the end it generates SQL queries but comes with syntax and patterns to use inactive relationships when needed, compute time intelligence, and other things you'd expect for analytical/aggregate purposes. But that language is proprietary and I'll be watching with interest whether Malloy gets any traction. DAX has a steep learning curve and it's a bit distressing to tie one's career to a single vendor, even one as big as Microsoft.
As someone who briefly worked for Microsoft after an acquisition, I think it's a general "ok boomer" attitude in Silicon Valley towards Redmond. That kinda sorta made sense ten years ago, when Microsoft seemed to slowly fading from cultural relevance (one $20b quarter at a time). Now, it seems like it makes less sense, but brand perception is hard to shake.
And yeah, I think that hits on some of the problems with Microsoft now though. It still feels like a closed ecosystem that largely sells to other ecosystem players. That's not entirely true, but it's not that wrong either. And for a general data community that values modularity and interoperability so much (which I don't think is necessarily right, to be honest) that makes it an uphill battle.
If you think about it, relational logic is extremely complicated when you introduce time into it, and similarly where SQL is wanting is when you need to to metrics with a sequence element.
I think you're often better off trying to define states (as opposed to recording events) in some other type of language too. Jinja doesn't seem like a good solution there.
Long live sql. It forever to get good at it and everyone attacks sql simply because they want to be 'thought leaders'. Get out of here with you terrible fly by night language. If you are going to solve problems just solve problems - leave sql alone.
I spoke to Lloyd a week ago about dbt models that could express their logic as Malloy! Way ahead of you :P
I have zero confidence that this will actually happen in some kind of near-term, but I say it just because I agree with you. I don't think dbt-as-a-framework should insist that you author your business logic in SQL, it's just that this was the best place to get started. Imperative languages like Python, languages-that-compile-to-SQL--all of these are reasonable ways to express business logic. It's just important that they can all execute in the cloud data platform and not require data movement. Ultimately it would be great if dbt-as-a-framework could understand the syntax of each of these expressions and use that knowledge to not just operate on your business logic as text (as it does today) but know exactly what you are expressing.
Lots and LOTS of work to do.
I do think that dbt's concept of a metric will allow you to do exactly what you're describing with the sales funnel though. We do not currently define a metric of type "funnel" but if you could describe how that should work we could--or you could--implement this (the metrics definitions themselves are all in user-space and all Apache 2.0). Define what a sales funnel is, configure as many of this type of metric as you please.
I have to admit, I didn't think about dbt *and* Malloy working together...
And all of that makes sense in the longer term. I think it'll be really interesting to see how far Jinja gets stretched by what people do with metrics. It feels like entire mini-languages could get built within it, to the point where a lot of what you're describing—an imperative language that compiles to SQL—almost falls out by accident.
Great post Benn and thanks for the shoutout. There is a lot of power in the relational model that isn’t accessible via SQL. This is what inspired us at RelationalAI to work on a relational language and system to help people get further than they can get with SQL alone without having to give up the relational model and all its goodness. It’s also exciting to see Tristan write that dbt-as-a-framework has aspirations to support languages like Malloy. Malloy is a great example of a relational language that is not SQL. You can find many other cool and mind-blowing examples of relational languages at Jamie Brandon’s conference on HYTRADBOI (have you tried rubbing a database on it?).
Thanks! There was an idea there I couldn't quite figure out how (/ran out of time) to articulate, which is how something like RelationalAI potentially collapses the awkward game of telephone that semantic layers play today. People want to write SQL, and databases want to speak SQL, but to inject a semantic model in the middle, we go query to semantic translation and back to SQL again. If the database is built on semantic models rather than relational ones, the first two steps are the same—query to semantic translation—but you get to skip the third.
Let me see if I can explain how we try to avoid the telephone game. Apologies in advance if this is more detail than you wanted. I’m happy to discuss further when we talk next.
Let’s first start by making a distinction between the relational paradigm and SQL. When the relational paradigm was invented 50 years ago, the two core languages that defined it where mathematical. One was called the relational algebra and the other was called the relational calculus. The early relational database community realized that most people would find them hard to use so they went about inventing relational languages that were more accessible. SQL was one of many competing proposals (QUEL, QBE, ISBL, PIQUE are examples of many others). For a variety of reasons SQL emerged as the industry standard. Systems that implement SQL (or any relational query language for that matter) will typically translate SQL to relational algebra which is the mathematical language that database engines really understand. This is true for modern cloud database management systems like Snowflake, BigQuery, Redshift, etc.
The cool thing about SQL and other relational languages is that they are declarative — in other words they remove the burden from the user having to write procedural code telling the computer step by step how to compute something efficiently and safely. When we write programs procedurally in languages like Python, Java, or C#, we have to write code that we wouldn’t have to write if we used SQL. For example, we have to write code to implement the right data structures and the right algorithms. SQL takes care of query optimization and evaluation for us. We also might have to write code to make it possible to run the algorithm on many cores or write code to manage data sets larger than memory. SQL takes care of parallelization and memory management for us. Writing procedural code requires us to have more expertise and to make more effort. In my experience we need to have a few more years of technical education and we need to expend 10 times (or more) effort to write the procedural code that implements non-trivial application logic vs what is required to write it in SQL. This is why there are many more people who can write SQL than people who write procedural code.
If writing procedural code is so painful then why does anyone do it at all? Wouldn’t it be great if our relational systems would just let us write all our application logic in a relational language like SQL so that we don’t have to worry about all that procedural complexity? Of course it would be but until recently computer scientists didn’t know how to build relational database management systems that can express and run sophisticated application logic. For example, powerful relational databases like Snowflake and BigQuery still recommend that you use other technology for graph and reasoning workloads. You also have to leave them if you want certain predictive and prescriptive analytics. That’s why most application logic is written in procedural languages like Java, C#, and Python.
We’ve had to write procedural code out of necessity, not because it is less effort. Whenever we hit the limits of SQL systems, we built escape hatches that let us work around those limits by writing procedural code. One such escape hatch is writing procedural code to implement application logic outside the database. Another such escape hatch is writing code in a stored procedural language that runs inside the database. Regardless of where the code runs (inside or outside the database), we are on our own to make it efficient and scalable. The database engine has no idea what the code is trying to do — i.e. it doesn’t know the intended semantics. Because it doesn’t know the semantics, these powerful cloud data platforms cannot use their query optimization and evaluation machinery to help us out. To make matters worse, we seem to have developed a sort of Stockholm syndrome or a psychological bond with our procedural captors. We’ve learned to think of the escape hatch as a feature and not as a reminder of the limitations we have to work around.
The good news is that history has shown that it’s only a matter of time before computer scientists figure out how to make important workloads run on relational databases. In the 80’s they invented SQL, join algorithms, and paged data structures that made relational databases the standard for transactional operational applications. Who remembers CODASYL, the procedural/navigational technology relational databases replaced? In the 90’s computer scientists invented parallel joins and column stores that made SQL the standard for analytics and BI. Who remembers the array based MOLAP technology that relational databases replaced? Ten years ago computer scientists invented a new cloud-native architecture and new JSON language extensions to SQL to make relational databases like Snowflake the standard for Big Data analytics and data transformation (with a little help from our friends at dbt). Does anyone remember how Hadoop was supposed to make SQL and relational systems obsolete? Seemingly overnight, Hadoop went from being the future of data management to being Cobol. The next time we talk you should ask Bob to tell you his funny story about how as CEO of Snowflake he was turned down for funding 27 times just 3 years before Snowflake had the biggest software IPO in history.
If history has taught us anything, it is that the relational paradigm always wins.
So where do we go from here? How do we get to the promised land where we can easily build rich models of our business on the modern data stack with all its declarative, secure, and scalable SQL and relational goodness? My colleagues and I at RelationalAI have a modest proposal :-). We are building a *relational* knowledge graph management system based on brand new join and query optimization algorithms that make it possible to declaratively express and efficiently execute sophisticated business models and application logic. RelationalAI speaks SQL and extends it with a stored *relational* language called Rel (think Malloy with a different syntax and a few extra features). In this arrangement we can use semantic frameworks like dbt to build as much of the rich business model in SQL as possible. However, whenever we hit the limits of what SQL can express clearly or run efficiently, instead of having to leave the comfort of our relational cocoon and subject ourselves to the harsh complexity of procedural coding, we can use Rel to write queries and formulas that clearly express and efficiently compute the semantics of our business model. Since the output from these Rel queries is stored in tables and is queryable via SQL, we can continue to use whatever cloud database platform and analytical engineering tools we like (e.g. Snowflake, dbt, Tableau, Looker, etc).
I totally agree with Tristan’s comment that dbt-as-a-framework should not require that the business model and application logic/semantics be expressed only in SQL. I also totally agree that it is important that the application logic “execute in the cloud data platform and not require data movement”. However, it’s not clear to me that we want to express our application logic in procedural languages. Procedural code is much harder (and in some cases impossible) for our tools to analyze to understand the intended semantics. Without knowing the intended semantics, we can’t use our cloud data platform to automatically optimize it, parallelize it, incrementalize it, or determine data its lineage, etc. I think we should do everything we can to make it possible to build our business model and express our application logic in a relational language. Ideally it would be SQL but if we can’t do it in SQL we should fall back to a more expressive relational language running on a modern relational engine that can evaluate it efficiently.
Apologies for the long reply to your short question. I hope this helps explain what we have in mind. I’m looking forward to chatting about this some more offline.
p.s. If you agree with me that SQL is just one of many possible flavors of relational languages (just like there are many flavors of procedural languages) then I hope you allow me to suggest a minor change to an otherwise awesome post. I think it would have been clearer for the subheading “Business models aren’t relational models” to be written as “Business models aren’t SQL models”. They could be relational models if you express them in languages like Malloy or Rel.
Thanks! I really appreciate all of the detail here. There's a lot to talk about on this (and looking forward to hearing Bob's stories), but two things really stand out to me.
First (and this is place where I'm admittedly ignorant of my history), the general ratchet to declarative languages is something I hadn't considered. I remember the immediately demise of all things MapReduce as soon as we had SQL layers on top of Hadoop, but never generalized that. Thinking about other places where that applies - and in particular, places where declarative languages can replace procedural ones - puts that entire trend in a very different light.
Second, the idea of expressing semantic/application logic is something declarative (even if it's not SQL) feels very appealing to me. Things like LookML feel like they approach this is the right way in spirit, where I can specify how to model the various elements of my business or application, and Looker figures out what to do about it. But it feels very limiting in practice, partly because of the design constraints in LookML, and partly (I suspect) because of LookML's close ties to SQL itself. But, if implemented in a way that felt a little cleaner (which I think Malloy is a step closer to), that starts to feel very interesting.
A lot of the comments on Twitter about this post were focused on how SQL is fine for semantic modeling; the problem is the structure of the data, or the complexity of the business itself. But in the context of your comment, I feel like those points aren't quite right. SQL isn't very good for semantic and application modeling; we just want to think that it is, because declarative languages are good for semantic modeling. But SQL doesn't have to be the only declarative language we do it in, if only something better came along...
if only ;-)
If you think about it, relational logic is extremely complicated when you introduce time into it, and similarly where SQL is wanting is when you need to to metrics with a sequence element.
I think you're often better off trying to define states (as opposed to recording events) in some other type of language too. Jinja doesn't seem like a good solution there.
gaaaaah now I'm upset I didn't think of anything to do with time as a better example of this. Turning contracts into daily ARR in a relational model is a nightmare, and I still don't know how to do it without doing something that feels like some heinous hack.
And agreed that Jinja doesn't make sense in its current form. But I could see a Rails/React-ified version working out, though I'm not sure what that would exactly look like.
Part of why SQL gets an opportunity to thrive by being so dumb is that it lives in an ecosystem where things around it are constantly doing the things it is bad at. Other software that can change something's state by having implicit assumptions about when business logic would be relevant instead of being explicitly and verbosely defined by a set of relational definitions, which means that SQL keep its own scope to stuff that is achievable with in its own rigid boundaries.
That makes sense, though it feels like sort of sad conclusion - "we have to keep doing this in a clumsy way because if we did it in a better way, we could do much with it and we'd hurt ourselves." I like the dumb thing (in no small part because I can understand it), but surely there's a way to better thread that needle.
@benn Funny, I worked with DHH on the first RESTful version of Rails, and recently I’ve been prototyping ways to bring that kind of “convention over configuration” thinking to dbt. Would love to brainstorm with you some possibilities for how to pull that off: https://calendly.com/drernie
Cool, would be interested in checking it out!
Here’s a sanitized example of what I am currently using in production https://github.com/TheSwanFactory/cqml/blob/main/tests/cqml_test.yml
I'm not gonna pretend that I can make sense of this yet...but color me very curious.
https://github.com/TheSwanFactory/cqml
The central idea is replacing “languages” with a Domain Specific Data Format that carries the semantics we care about using a generic syntax. In theory, this makes it easy for a wide range of tools to both consume and produce it, allowing arbitrarily sophisticated levels of automation and flexibility.
SQL does have limits... but those limits can be offset!
I love all of these posts. One thing that I think hasn't been discussed here yet, as far as I can see, is row-based versus set-based processing. Row-based processing is perhaps a bit more human intuitive (i.e. "Okay let's look at this case/thing, now this next case/thing, now this next case/thing") but it doesn't scale theoretically or practically. It's kind of like comparing deductive (row-based) to inductive (set-based) logic, although the metaphor isn't perfect. As programmers, which in one way is being applied mathematicians, if we are doing our jobs well, we are always moving towards the general case using inductive logic.
SQL in its basic form is mostly about set-based logic. And that's because relational algebra, which underlies SQL, is a branch of math that is about sets. If you force it, a la PL/SQL, cursors, or whatever other method, you can get it to do row-based processing, but that's usually a bad bargain for someone who isn't comfortable with set-based processing.
This is why, in practice, it often seems a struggle to get folks coming from a non-analytics-oriented software development background to transition to working efficiently with databases - because, in practice, other software development paradigms seem to naturally conform themselves to row-based processing. That might be just an artifact of the way we teach or describe it - I'm not saying it's inherent itself to, say, Python. I can't count the number of times I've spoken to someone and said "OK, having access to an API is nice, but now can I call the API in such a way that it gives me ALL the data rather than having to go row by row or a few records at a time, because I want to analyze the whole thing at one go?" This seems like such an intuitive request to me, but somehow that conversation regularly seems to be difficult and painful. And "painful" not necessarily from the practical concern of "it would be too costly to let you have all the data or expensive on the system, etc." but more like "Why in the world would you ever want more than a few rows at a time?"
There is probably some relationship between a language being declarative and being set-based but I'm not sure it's exactly the same thing.
Also, as others have noted here, SQL in some cases is declarative (within a given query). But.... it is very easy to accidentally code a series of SQL statements and realize that one has built the equivalent of a set of imperative transformations. Maybe not "imperative" in the sense of "read this line from a file", "replace all the cases of the letter 'a' with 'z'", "write a line", but imperative in the sense of "I as a human being can look across all of these intermediate, temporary query steps and see ultimately that there is meaningful output to this, but looking at it step by step, I'm giving very specific details to the computer regarding how to go about it that the computer can't easily optimize the various aspects of." dbt can help with that but it's not a 100% solution.
Thanks! On the row- vs set-based conversation, that's potentially an interesting explanation as to why thinking about data in SQL vs in Python feels so different. When I first started doing data work, most everything I did was in Excel or R. When I started working in SQL, it took me some time to reorient myself around the way SQL made you think (and this wasn't the declarative vs procedural I don't think; it was more about how I conceived of what data was). I've always thought that it was mostly because SQL pushed you to think of everything as relational tables. But it may be more that it breaks you away thinking strictly in rows.
And I like the point about declarative languages still leaving the door open for lots of mistakes. While you may not have to figure out the exact recipe to create the thing you want to create, you've got to describe something really complicated and nuanced. Even if SQL could be plain English - "I want a table with one column like this, and one like this, where each row represents this" - it's still really hard to exactly describe what you want. So in that regard, to most people, I suspect it's a difference without a distinciton.
I've been working in this space for 40+years. Problem is you do need a new language and ecosystem. You also need an AI engine. And there's the rub. The thing about the semantic layer is you can't program it all. You need something that understands how to convert your specification into working code and it can't be done statically. Record by record as process the query.
So here's what I do. https://unibase.zenucom.com. It's been in service for decades building complex high performance applications - ie it's not a proposal or a nice idea - it's a real hard working semantic database.
There's a data dictionary that defines tables, fields, calculations (that might include results from other tables). It also defines relationships between tables.
There's an AI engine that can understand all that so that when a field from a table is requested it can return the stored data, or do a series of calculations and data retrievals to give an answer.
No SQL, just layouts that get interpreted against the dictionary.
It's a big system, as you'd expect after all these years, but the important thing is the semantic layer and AI engine that makes it happen.
SQL is not non-procedural as every statement is an instruction to do something. Unibase is non-procedural as everything is a specification until a program is run to produce an output from the specification. This is not splitting hairs because you don't know how the specification will be interpreted until something tries to use it.
Yeah, the idea of specifications as semantics (though I only have a hand wavy notion of what I mean by that) feels very appealing to me. As Molham said in his comment above, I don't really thing of semantic modeling as a procedural process. I'm not trying to tell a computer how to solve I problem; I just want to describe the system to the computer, and have it figure out what to do about that.
Let me specify to you how my business works, and then you spit out all the implications of that.
That's the idea.
However all normal computer processes are very procedural. That's why you need something else. A simple example is html. The html engine does all the layout, calculations, etc you just need to specify the layout. There's no real AI as there is a direct correlation between the specification and the display.
With Unibase the correlation is very indirect. This is because of the equation solver that looks at a lot of equations and works out the order to evaluate them and their dependencies. It works out when to get records from other tables, it works out how to get the right records, and it can return summed values for use in equations higher up.
The result is typical programs are very short. In the order of a 20 - 30 lines. Big programs reflect very complex data display. That's what you can't do easily with SQL and PHP, or React etc.
I really like your entries, among other things because you're aware of Microsoft's prior art. Many blogs from startups in the MDS space seem to ignore Microsoft - I'm not sure if it's out of ignorance, fear, hipster coolness, or contempt, but in any case it's weird to pretend one of the biggest BI/DB players doesn't exist.
To the point of this specific post: during the past decade, Microsoft has moved away from multidimensional cubes in SSAS and focused on tabular, putting the emphasis on Power BI and the DAX language (also available in Excel by the way). I have mixed feelings about DAX but you can't argue with the fact it's widely adopted. Yes, in the end it generates SQL queries but comes with syntax and patterns to use inactive relationships when needed, compute time intelligence, and other things you'd expect for analytical/aggregate purposes. But that language is proprietary and I'll be watching with interest whether Malloy gets any traction. DAX has a steep learning curve and it's a bit distressing to tie one's career to a single vendor, even one as big as Microsoft.
As someone who briefly worked for Microsoft after an acquisition, I think it's a general "ok boomer" attitude in Silicon Valley towards Redmond. That kinda sorta made sense ten years ago, when Microsoft seemed to slowly fading from cultural relevance (one $20b quarter at a time). Now, it seems like it makes less sense, but brand perception is hard to shake.
And yeah, I think that hits on some of the problems with Microsoft now though. It still feels like a closed ecosystem that largely sells to other ecosystem players. That's not entirely true, but it's not that wrong either. And for a general data community that values modularity and interoperability so much (which I don't think is necessarily right, to be honest) that makes it an uphill battle.
If you think about it, relational logic is extremely complicated when you introduce time into it, and similarly where SQL is wanting is when you need to to metrics with a sequence element.
I think you're often better off trying to define states (as opposed to recording events) in some other type of language too. Jinja doesn't seem like a good solution there.
Long live sql. It forever to get good at it and everyone attacks sql simply because they want to be 'thought leaders'. Get out of here with you terrible fly by night language. If you are going to solve problems just solve problems - leave sql alone.