We don’t need another SQL chatbot

Jul 14, 2023

We want one, to do the tedious parts of our job. But it might be better suited to take the fun parts.

43 Comments

I just want to note that philosophers, and mathematicians, and a whole lot of other people... invented formal languages because natural language is such a complete shitshow of ambiguity and now a lot of people are gonna reinvent this multi-century old wheel at a ridiculous cost... To the extent that natural language can clearly express what I want so a bot can write it? Great! Bring those on! But there's a big gap from that and everything else that needs to be done.

Expand full comment

Yeah, I had this whole section in here about how LLMs are like when senators "write" bills that are three pages long, and then lawyers have to actually translate that into a real law with real specifics that's hundreds of pages long. And it would be ridiculous to think that we could have a real legal system based on the former - but that's effectively what a lot of LLM optimists are arguing for. This post was way too long as is though, so it got cut.

I'll save it for the how LLMs fail post, I guess.

Expand full comment

Love the article, thanks for that!

I agree that text-to-sql does not work for complex and messy DB schemata.

That's exactly why I am now investing a lot into how to implement text-to-semantic-model. Then, it follows with sematic-to-sql translation, nothing new here (Malloy, DAX, GoodData MAQL).

It requires a huge effort to create such semantic models on top of DB models, with all the business titles/descriptions for entities such as attributes, facts, and metrics.

Thinking about if LLM could help with the onboarding (bootstrap), generating as good as possible titles/descriptions from table/column names.

Experimenting, looks like most powerful models nowadays are surprisingly good in this area as well.

Btw., companies having nice DB models (or even dbt models with DOC) will have competitive advantage here!

Expand full comment

Interesting...text as semantic model to generate things like LookML style models, or more documentation, cleaner schemas, etc?

Expand full comment

At least for Snowflake, SnowBuilder (patent pending)😎 Virtual transformation for.your cryptic column names into natural language names. Fully qualified names automatically, plus provide UPPER_CASE columns for key column. Different names, same column.

No more hitting the wrong key when reaching for quotes. In private preview.

The initial version of SnowBuilder was in fact used to build exactly what Benn described. Completely converted seven, QlikSense applications into seven delivered Snowflake views.

17 total VIEWs, 10 of them reusable.

2500 total columns.

29 existing data sources.

Only 2 simple hand written CREATE VIEW statements; everything else generated.

2267 name transformations.

121 expression transformations

124 UDF calls (created about 50 UDFs in place of QlikSense functions).

We will be providing a SnowBuilder repository for this for demo, exploration and teaching purposes. You won't be able to execute the resulting views, obviously, as you won't have the underlying data warehouse. Other demos will be executable.

Expand full comment

As long as it can stop Snowflake from defaulting every column name to YELL_AT_ME, I'm in.

Expand full comment

Not sure what you mean, but SnowBuilder doesn't have DEFAULT columns. It's all about querying. The foundation is your existing Snowflake data. It's all "meta-data", physically in Snowflake tables. But it's "data types" are transformations, not "NUMBER, VARCHAR...'

Although nobody is talking about presentation layers, a core transformation in SnowBuilder is NAME transformation. Since there is near 0 text editing, and, as mentioned, you can have multiple column names against the same column, it lets you perform rapid, bulk conversion of you cryptic names into natural language.

The not-SnowBuilder generator initially converted Salesforce columns into field labels. The convention was changed several times due to the complete lack of governance by the Salesforce development team.

Expand full comment

Oh, it was a joke about Snowflake defaulting everything to being uppercase, and it feeling like queries are yelling at you because of it. (For a longer version: https://benn.substack.com/i/71510341/karma)

So I'm not entirely sure I understand the idea behind SnowBuilder. Is it, very roughly speaking, a semantic layer of Snowflake, so that you can write something like

SELECT day, arr FROM revenue

and

SELECT day, revenue FROM revenue

and it'll both perform the calculation to get revenue, and will reference the same result for both because revenue is an alias for ARR?

Expand full comment

Love it!

Imo a blind spot in the BI space is not exploring using AI to accelerate or support the dashboard or analytics creation experience. I remember traditional BI vendors saying the majority of the time in a journey to building and deploying a dashboard (~80%) is actually not spend on creating the dashboard or analysing data. It's spent on finding data, understanding data, exploring data, preparing a semantic layer (defining dimensions, measurer, aliases, buckets, drilldown groups), and then writing expressions (think DAX, set analysis etc.) & designing/styling charts or dashboards.

And to get to this result, you would typically need to be an experienced BI developer combining a unique venn diagram of skills compromising data, business domain and design knowledge.

Can AI help simplify some parts of the journey in form of a BI developer co-pilot? A semantic layer generation based on a prompt? Chart or data suggestions to pick from? Generate a good-looking and user-friendly dashboard? Style a chart to look like the economist styling guide?

Suppose you can enable your average business user or citizen analyst to become more and more self-sufficient. In that case, you'd allow a hyperagile workforce to bring more meaningful data to data consumers faster. The analysis support definitely has value, but people usually just want to see sums, counts, averages (maybe ratios).

Expand full comment

For sure, I agree with all of this. I don't think BI tools should ignore AI; my point is more that AI chatbots can't ignore BI.

(I was trying to explain that in this other comment too, which goes into a bit more detail: https://benn.substack.com/p/we-dont-need-another-sql-chatbot/comment/20822426)

Expand full comment

Shameless plug from atoma.ai - a pivot from patterns.app where we spent the past 2 years building an end-to-end python and SQL data orchestration and visualization tool where atoma now get's all of it's juicy context to do it's job...

Idk... hard to deny the enabling capabilities of LLMs for truly self-serve BI which has been a pipe dream for decades. Are virtual intelligences vs. dashboards a real enough product differentiation to win buyers wallets? We'll see!

(also no need for filters or any complex dashboard UI because the AI can just write vegalite)

Expand full comment

damnit chris, i'm trying to sell thoughtspot here, i need them to think i'm useful

Expand full comment

(ok, but more substantively, I think that gets at my question, which is it possible to build the virtual intelligence thing without building the dashboard too? In theory, I get that it could be - just ask a question, get an answer, who needs the rest of the BI cruft? But to quote someone I probably found by googling "history of technological advancement" (or maybe Josh Ferguson), big technological waves tend to consume their prior waves. The internet was print plus more; streaming was TV plus more; etc. So if AI is replace BI, it probably has to do BI plus more. Which is why I'm skeptical that chatbots will be able to avoid building most of BI.)

Expand full comment

IMO, a chatbot alone without traditional BI features (human SQL IDE, dashboards, alerts, etc) isn't going to replace a BI tool... your company is still going to need a human option to operate the system and some way to host the KPI dashboard and it will be very hard for the chatbot to do it's job without that system context and a natural place to save it's output.

I think the value props for a virtual intelligence do a lot that BI tools aren't great at or simply can't do... 1 on 1 data tutoring, unlimited instant ad hoc questions, analytical reasoning, and all the other LLM stuff like text search and summarization...

Expand full comment

So I agree with most of that that - you can definitely use AI to augment stuff with BI (or BI adjacent) workflows. The analytical reasoning bot is basically what I'm proposing at the end.

The place where I disagree is "unlimited instant ad hoc questions." Because *that is BI.* And time and time again, people try to pull some subset of BI away from BI, and justify it by saying we'll make this piece way better on its own. And time and time again, one of two things happens: It either works well enough that customers want to use it, so they start asking for all the adjacent BI features it's missing until the company backs into building BI tool; or, it doesn't work well enough to entice people to fragment their BI tool up so that they can use this other thing. Again, maybe this time is different and all that, but at least on that particular use case - answering basic questions and looking at basic charts - I have my doubts.

Expand full comment

We're on the same page that *it is BI*... I guess I'm just saying that we can get more ambitious with the definition of BI and with the value it provides to end-users.

The thing I keep coming back to is that answering ad hoc questions without the help of an analyst is inaccessible to most non-technical users of a BI tool while also being the promise of all BI tools. The reality is that dashboard are one view into data and end-users can't actually add a new filter/field/calculation or tweak a visualization with confidence, so a lot of questions end up roping in an analyst to do the last mile iteration.

There's much latent value to be captured if everyone at your company could just hammer said analyst with any question of their hearts desire and get instant answers that are correct 90% of the time.

Expand full comment

Oh, 100%. I believe that so much that my entire bank account basically depends on it (via this ThoughtSpot acquisition).

But I don't think that's inconsistent with my point here. Could BI be made better by using AI to make basic questions easier to answer, by providing a natural language interface for asking them instead of an interface like Tableau's? Absolutely. But does that mean that we can realize that improvement by building a chatbot that doesn't do most of what BI tools do today? I'm doubtful. I think you need both - the BI stuff (permissions, visualizations, connectivity, dashboards, etc) and the AI stuff. Just one is insufficient.

Expand full comment

What a fun run to be building! It’s going to be interesting to see where products end up as a function of their existing tech... and also new concepts from ground up!

Expand full comment

So here is my idea:

- collect semantic model using API/SDK of BI vendor

- generate training data (Q/A, ...) from the model

- train (fine-tune) or prompt a LLM

When end users ask questions to such LLM, the quality of answers should be much better thanks to semantic properties mapped to column names.

Under-the-hood, LLM can translate questions e.g. to a report definitions consisting of entities from the semantic model.

For instance: Report{ attributes => [customer_name], metrics => [revenue] }

Such report definition can be sent to BI server, which can translate it to SQL, execute it and return data (JSON, data frame, arrow, ...).

Expand full comment

So I do think there's something to that, where you can piggyback off what people write in handwritten queries to infer things that are useful for an LLM. I'm not sure the best thing to do there is necessarily turn it back into a query (it seems like it might be better to build formal semantic models), but I think that gets at one way to make this work.

Expand full comment

Snowflake has always had the ability to create mixed case with spaces. column names. You just have to embed them in double quotes, like most RDBMS (except SQL Server). Upper case is the SQL standard unless qualified by double quotes.

I figured it was joke, but didn't get it.

But having to constantly qualify mixed case with double quotes is a royal PITA, as is much else in writing VIEW, i.e. COLUMN COMMENTs

SnowBuilder (patent pending) removes 90+ of that.

It's really much, much more. As soon as I get domain issues resolved for new website, I will publish much more.

It's pure SQL. It generates DDL and DML from meta-data repositories (Snowflake tables). It performs what I call Virtual Transformations. Instead of transforming physical data, it does it as VIEWs. Semantic (and presentation, not that anybody talks about that) are build from layered VIEWs. "Serially reusable meta-data".

Extract, Load, Virtual Transform (ELVT) is the workflow. But the Extract is meta-data, not data.

We call it "SnowBuilder SQL", e.g user functions. SBLD_VIEW, SBLD_SELECT take the name of a "VIEW"

and generate CREATE VIEW DDL and the SELECT statement respectively using the source VIEW name in a repository.

NAME transformation is a key feature. Everything generated is always fully qualifed, e.g "Customer Name".

I refer to the as "Freedom from the tyranny of the pinky finger"!

No SnowBuilder UI. Editing either direct SQL for bulk editing, or pull meta-data into spreadsheet, edit. and load back into repository.

All user managed and delivered as SECURE UDF via SHARE (marketplacee). Unlimited usage and no limit on repositories.

Expand full comment

On Snowbuilder, gotchaaaa.....this makes sense me. I've talked about something kind of like this a few times, where you could run a semantic layer like Malloy directly in the warehouse. This sounds somewhat similar to that?

https://benn.substack.com/p/how-dbt-fails#:~:text=For%20example%2C%20BigQuery%20began%20supporting%20Malloy%20natively.%20Malloy%20became%20to%20SQL%20what%20TypeScript%20is%20to%20JavaScript%3A%20A%20syntatic%20semantic%20superset%20to%20raw%20queries.

Expand full comment

I think Benn should write a Children’s Book - “If You Give a Mouse a Datum.”

Expand full comment

I dunno, the word count on a children's book seems like a real challenge.

Expand full comment

You could do better than AI - case and point:

Title: If You Give a Mouse a Datum

If you give a mouse a datum, he'll ask for a computer to analyze it.

When you give him the computer, he'll want to create a program to sort the data.

After he writes his program, he'll realize he needs more data to make it more effective.

So, he'll ask for an internet connection to gather more data.

Once he connects to the internet, he'll discover an array of different data types and want to learn more about each one.

Learning about these different types, he'll realize he needs to enhance his coding skills.

So, he'll ask for a coding manual to start learning new programming languages.

Once he has the manual, he'll be so engrossed in learning that he'll forget to eat.

When you remind him, he'll ask for a slice of cheese.

As he enjoys his cheese, he'll realize he's been sitting for too long.

He'll want to go outside for a run, and ask for a pair of tiny sneakers.

Running around the yard, he'll spot a beautiful butterfly and chase after it.

Chasing the butterfly, he'll forget all about data and computers.

When he finally sits down, tired and satisfied, he'll spot a single piece of data in the grass.

And chances are if you give a mouse a datum, he's going to ask for a computer to analyze it.

Expand full comment

Wow, ok, that was more of an existential journey than I was bargaining for.

"If you give a mouse a datum, he'll come to question his entire existence, and long for the simple freedom of his forefathers, the country mice, who had not yet been transformed into rats in a race."

Expand full comment

It took some turns. Lol!

Expand full comment

fire article! thank you!

Expand full comment

Thanks!

Expand full comment

Almost every one of your post makes us looks at things a different way, introduces a new paradigm, like *no-one else* does. It's especially true for that specific post. I am so amazed... and thankful for all these thought-provoking articles you are writing and sharing with us.

Expand full comment

Thanks! I really appreciate that, and glad you like it.

Expand full comment

Benn, you have an amazing talent to be extremely right and extremely wrong at the same time

Expand full comment

Sometimes you gotta put it on all black; double or nothing. (But also, go on?)

Expand full comment

you're over-biasing on the obvious, which yes, happens to be 95% of how engineers think about their products. But it does not mean that everyone thinks in those terms. And to ignore the technological advancement just because 95% of the entrepreneurs are bad - well, that's exactly how you fail. The right question to ask is NOT "why does this thing fail?" - but what would make something succeed. And if you've already written plenty in your previous posts about how some of the things in this general area can succeed. It is just all about putting the entire puzzle together IMHO

Expand full comment

I get that in broad terms, but don't understand how it relates here? I'm not sure where I'm saying most engineers think this way? Or what technology I'm saying should be ignored because of that way of thinking?

I hink most *customers* think in a particular way, but that's a firmer reality than how a company thinks.

Expand full comment

right. I got lazy and missed the 3rd section :). Yes, we're in agreement. But the title is clickbait. The content is a more nuanced.

Expand full comment

Gotta juice those numbers so I can get my Elon check

Expand full comment

Why not just provision a vector store with context info then feed the bot the prompt it needs to translate the queries you're discussing? Your take is naive.

Expand full comment

You could, I guess, and that might work? But also, I'd argue that defining the context is the work. For example, in the BI case, using a tool like Looker is relatively straightforward; the hard part is defining all of the LookML that tells the tool the context it needs to do what it does.

You could define that context somewhere else, but...that sorta defeats the purpose? I guess you could write it in plain language rather than YAML, which may or may not be easier - YAML (and code in general) is unnatural, but it's precise.

And I think that's my bigger point here. If you want an LLM to do a very precise thing - ie, data analysis - you have to tell it to do that precise thing, either through the prompt or in the vector store. And at some point, that seems like it just turns into regular old engineering, with a probabilistic chaos agent in the middle.

Expand full comment

I don't know about chaos agents; I'd say it is innovativion. This new technology will increase the returns to and leverage from a certain kind of knowledge (data engineering) while superannuating others. But it is hard to predict how things will shake out.

Expand full comment

Sure, I agree with that, generally (this is basically my theory on the whole thing: https://twitter.com/bennstancil/status/1631790710381748225)

So yeah, the only real answer here is who knows. But if I had to guess (Friday deadlines being what they are) I'd guess that we're going to have to do a lot of massaging to LLMs before they're as valuable as we all (me included) got hyped up into believing.

Expand full comment

My Very Smart Take is that, since LLM's are fundamentally loss minimizing agents, their potential impact is limited since the "smarter" they get, the more able they will be to misconstrue the intent of difficult requests so as to provide a technically accurate, but lazy and useless, solution. They are the genius slackers of the tech world, but with no drive to invent at all.

Expand full comment

I sort of like that take more than the "they will learn to conquer humanity" take. They could do that, but, like, nah, sounds like a lot of work, let's just chill.

Expand full comment

#nojs-banner { position: fixed; bottom: 0; left: 0; padding: 16px 16px 16px 32px; width: 100%; box-sizing: border-box; background: red; color: white; font-family: -apple-system, "Segoe UI", Roboto, Helvetica, Arial, sans-serif, "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Symbol"; font-size: 13px; line-height: 13px; } #nojs-banner a { color: inherit; text-decoration: underline; } This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts