The smol analyst

Maybe there’s a way to make these chatbots useful after all.

Aug 18, 2023

Netflix Announces 'Boss Baby' Animated Series — Technically, this is a smol CEO.

A few months ago, Shawn Wang, who is more widely known as swyx, launched a viral GitHub project called smol developer.1 At its core, the project is an AI agent for writing code, but it puts an innovative spin on what is quickly becoming a tired category: Rather than offering its users a chatbot that writes snippets of code on request, smol developer “scaffolds an entire codebase out for you once you give it a product spec.”

For example, a user might want to create a Chrome extension that blocks paid tweets.2 They would write a short description of the extension, including what the tool is and details about what they want it to do—keep a counter of how many tweets it’s blocked; give me a way to whitelist some accounts; make the icon Bam Adebayo. They submit the spec to smol developer; it returns the entire application.

Smol developer’s particularly clever twist is that, after it produces its first draft, it then provides a kind of feedback-oriented IDE where people can tell it what they like and don't like about the application that it just built. They can ask to change the design, request new functionality, or tell it anything that a product manager might tell an engineer after seeing an initial prototype. The bot takes this feedback, combines it with the original spec, and updates the app. And so the process continues, with the user and bot reinforcing one another—the app improves as the spec gets refined; the spec gets more descriptive as the app goes from imagined concept to testable product.

Does this work? Can it create complex apps? I have no idea. Still, the approach struck me as subtly revolutionary. Most of what’s been created with LLMs so far has either shoehorned them around existing workflows—e.g., an automated assistant in a code editor—or shoehorned workflows around chatbots—e.g., a Slackbot for writing SQL queries. Smol developer is a step outside of the lines. It doesn’t directly help its users write code; it instead uses code as a silent intermediary between product and product manager. And rather than a chatbot, there’s an interface designed specifically for drafting, testing, and improving a product spec. If LLMs do end up changing how we use computers, this type of approach—one that introduces new ways of accomplishing some task rather than accelerating old ways; the car rather than the faster horse—feels like a glimpse of the future.3

“Make me a dashboard of song streams”

One of the small ironies of today’s SQL chatbots is that they help people do exactly a thing that data teams try to discourage. As analysts, we ask our colleagues to help us understand how our work will be used. They shouldn’t request some piece of data; they should instead tell us what they’re trying to achieve. And if they don’t tell us what they want to use some data pull for, the less tactful among us pepper them with demands to explain why they need it.

Bad bedside manner notwithstanding, it’s good advice. And importantly, we don’t recommend that analysts ask these questions just to make their jobs easier; we also recommend it because we can’t give a useful answer without it.

So why are chatbots different? If we have to ask a bunch of follow-up questions before we dig up some number for people, why are we excited about LLMs that mechanically do exactly that? What’s the difference?

There are two possibilities, I suppose. One is that we don’t actually need that context at all. Our back-and-forth could all be self-serving theater4 to hide the fact that we’re mostly here to answer questions and build reports.5 The second is that chatbots do need that context. Without it, they’re just another code-free BI tool that’s useful for basic reporting, but under-delivers on the self-service nirvana that’s long been promised. But, encoding “business context” into some YAML file sounds ridiculous, and describing every detail to a chatbot anytime you want to answer a question sounds exhausting.6 For this reason (among others) I’m generally skeptical that these bots will be that revolutionary.

But the smol developer approach—treat the bot like an eager but inexperienced employee—offers a third possibility. Instead of asking for answers, people describe the report or dashboard that they want to create. The LLM-powered “smol analyst” produces a rough dashboard, and the user provides feedback, just as they would a junior analyst. The spec gets more detailed; the dashboard gets more useful. And business context gets added indirectly, as needed, as the spec gets more precise.

Suppose, for instance, a music producer wants to figure out how a new release is performing.7 They could write a couple paragraphs about the report they want—identical to what they’d send a data team today. They could say they want to see daily streams, streams by region, and how many people have listened to the song multiple times. The smol analyst could crank out a bunch of charts, write some loose narrative about them, and return it to the producer. Then, just as they’d do for a junior analyst, the producer would send back feedback: This number looks off; this explanation doesn’t quite make sense; can you dig into this unexpected anomaly? The bot creates another draft, the exec gives more direction, and so on.

This could have a handful of big advantages over the chatbot-based approach. First, and most obviously, it'd be more accurate than a zero-shot bot that has one chance to get the answer right.8 Though today's chatbots are “trained” on prior answers, it's mostly through crude and infrequent up and down votes that only indirectly affect the underlying model. The smol analyst would instead get immediate and direct feedback on what it needs to improve, and could inject that feedback straight into a prompt.9 Moreover, by gradually refining their requests, people could probably push this type of bot to answer far more complicated questions than a typical chatbot.

Second, the back-and-forth could also help people ask better questions. We often don’t know what we want until we start looking for it. Just as it’s almost impossible to write a perfect product spec without testing an imperfect prototype first, it’s very hard to ask exactly the right question before seeing the answers to a few of the wrong questions. A smol analyst would encourage this sort of iterative exploration, which is good for both user and agent.

Finally, it seems like this approach—if it works—could be applied to adjacent problems with relatively little difficulty. For example, could we create ETL pipelines in this way? Data models? Orchestration schedules? You could imagine someone describing a data model in plain language, and an smol analytics engineer using it produce some scaffolding in LookML or Malloy.

The black box

There is, however, at least one very big reason why a smol analyst wouldn’t be as useful as a smol developer. In software, how code works is in some sense irrelevant; all that matters is that it works. I can test my ad blocking Chrome extension without knowing a line of Javascript, or that Javascript exists at all. If the tool does what I want it to, it works, no matter how “bad” its codebase.

In data, black boxes don’t work. Computational process matters. You can’t validate a dashboard by testing that it produces a reasonable-looking chart; you have to make sure that the logic behind its calculations are correct. SQL is declarative, but used for imperative ends—we need to know how it works, step by step. Software is the opposite: It typically uses imperative means for declarative ends.10

That makes the test-and-refine feedback loop much harder for analytical work than engineering work. Whereas a PM can tell a smol developer that their Chrome extension doesn’t seem to be blocking video ads correctly, a producer can’t easily tell a smol analyst that their dashboard is improperly counting skips as streams. Someone would have to review the code to know that.

One obvious solution to this is…to have someone review the code. Rather than giving every executive a personal data scientist, the smol approach could give every analyst a team underneath them. People ask (human) analysts questions; (human) analysts ask (smol) analysts for help; (smol) analysts produce the drafts, and (human) analysts review them. This is structurally similar to how a lot of data teams’ peer review processes work today, just with a lot more analysts.

Multi-model BI

There could also be another way out of the black box. In most conversations about LLM-based applications, we talk about them as if there’s a single model underneath the product. A model writes a SQL query, for instance, or responds to a support ticket. And the product is as good as the training is for that model.

Viable, a company that automatically analyzes product feedback and has been building on top of LLMs for several years, found a different way to be successful. Instead of relying on one refined model, Viable uses a network of them, each of which specializes in a narrow task.11 One organizes input data; one finds themes across those inputs; one writes summaries of each theme; one uses those summaries to author a final report that’s sent back to Viable’s users. Viable also uses ancillary LLMs to help people understand what it’s doing. There’s a model that describes the assumptions that the analytical models are making, and gives users a chance to correct them. If people want to clarify something, they explain it to the assumption model, which passes their feedback down to the analytical models, which update their work based on it.

A smol analyst could follow this same approach. When it’s asked to produce something, it could have one model describe a query plan back to the user, who could validate it or correct what it got wrong. Another model could translate queries into English summaries, just as an analyst might when they share their work back to an executive. And a different LLM could automatically create text-based data models like TextQL’s capsules from reports that were marked as correct.

Of course, none of this fully illuminates the black box. The query plan could get misinterpreted by the SQL-writing LLM, or the summary bot could get its explanation wrong. The only way to know for sure what a query or Python script does is by reading the query or Python script. But this kind of smol analyst—which, at this point, is more tol than smol—could go a long way in upgrading today’s bots from novel toys to potentially useful agents.

“Smol,” a popular social media app tells me, means “small,” but, like, in a cute way. I did not know this either.

Or, more accurately, an X extension that unblocks unpaid posts.

I’m not saying that this is the future; I’m saying that it’s the future if LLMs live up to their hype. At this point, that’s an open question. My view on it has swung from definitely yes to absolutely not—so, uh, I guess BS stands for Benn Stancil?

The real self-serve analytics was the analysis we did to make it seem like we were important. (Alternative footnote: “Self-serving theater about data” isn’t a bad name for this blog.)

“Finding insights isn’t even our job. And it’s not making decisions, which is a common misconception. Because actually, our job? It’s just…numbers.”

That said, TextQL, a SQL chatbot, is trying to solve this problem in a pretty interesting way. When we define data models today, we usually default to doing it in a very structured way, like a YAML file of joins and metric formulas. TextQL throws that out, and asks its users to define capsules that map questions and business topics to tables and columns—e.g., “to calculate revenue, join these two tables together, filter out these rows, and sum that column.” Not only is this easier for a human to read—including those who don’t know SQL—but it’s also probably a better way to express this information to an LLM. One of the hardest parts of building a SQL chatbot is compressing a huge amount of schema information into a relatively small prompt. Capsules provide a direct way of doing that: “If you get a question about this topic, use these tables and columns.”

Y’all. It’s happening. August 31. (I think? Griff, big fan, but we gotta talk about this date format. 31.08.23? Even Excel doesn’t recognize that as a date.)

This highlights another irony with LLMs. With bots, we tend to fret a lot about accuracy. We don’t want to use it, unless we’re completely sure we can trust it. But I used to be a junior analyst, and got a lot of stuff wrong—and people seemed kinda ok with that? I had to fix it, but there was some expectation that it might take me a couple tries. Why are we comfortable with that, but not comfortable with a bot that’s equally inaccurate?

Time and time again, the solution to making LLMs work seems to be “treat them like they work like a person.” That could apply here too. What would help a junior analyst produce a better report, a single unexplained like or dislike on their final draft, or immediate and direct feedback telling them exactly what to improve?

For more on declarative and imperative languages, check out this…AI-written advice column on LinkedIn? What on earth is that?

Yet again, to make the most of an LLM, treat it like you would a person.

John Wessel

Aug 22, 2023

I like the iterative approach that Viable uses. Dividing out tasks and having an instance of the LLM focused on that task. I have seen some really good results taking this a step further and building out personas. So you have Olivia Data Organizer, Tommy Theme Finder, Simon Summarizer each with focused skills that are relevant for the task at hand. I know it seems ridiculous but giving the model tons of context as to its skills, personality, abilities etc. seems to be a very effective way at focusing it and getting better results.

Expand full comment

1 reply by Benn Stancil

Matt Arderne

Aug 19, 2023

Generally agree that the SQL stuff is sketchy at best. I hate having to build up the mental model of someone else's SQL, CTE by CTE, .SQL by .SQL. If that person is as deranged and convincing as GPT, it can only be net negative, optimising for correct-looking rather than correct

The best use cases are seemingly creative rather than correct. Perhaps early data modelling is a good opportunity here. You could describe your business and existing data and systems to a few LLMs, they each develop a data model and then discuss the benefits of each. This isn't operationally integral so not really going to get much attention, but I'm wary of much else at this stage.

Beyond that, I think it is interesting to think about ongoing investigations, ie "can you dig into this unexpected anomaly", but giving the bot the capability to persist that instruction as an agent over time. Bots running the legwork on the exploratory angle, and ideally sense-checking each other before presenting would be quite interesting.

(to your analyst point, I was also a junior analyst, I would put a lot of effort into making correct-looking work, incentives!)

2 more comments...

benn.substack

Discussion about this post