33 Comments

@bennstancil I love this idea of future apps having schemas (and more generally architectures) that optimize for #llm (not human) convenience. #schema-for-bots or #schema-on-bits or something 🤓

Expand full comment

The part that seems tough is figuring out exactly what that schema is. But it seems doable?

Expand full comment

Thought provoking post, thanks Benn!

Do you have thoughts regarding the following:

The full-joined-event-tables (FJ) that you describe as being good for AI, are in my experience the very end of the DAG. In our dbt environment we have around 1800 models and use the (FJ) like models for exposure into BI tools.

Running generates SQL on these FJ models is kinda trivial because all you do with FJ models is filtering and aggregating. All the complex joins that might require business process knowledge have already been done for the AI.

So querying FJ models is not hard and also the smalles fraction of what our data department (at a scaleup) does.

The big junk of work (analytics engineering) goes into the construction of all the say 1700 models which in the end land in many different FJ tables. This junk of work would be interesting automating. But here AI is missing a crucial piece of information.

What’s the missing piece for AI? The understanding wrt to the business processes. The data landscape is so super fragmented: fetching data from 80 SaaS tools, internal APIs, public APIs, data from same sources being interpreted in different ways (business processes) depending on region ... chaos in terms of data integration.

So the hard part for the human is mapping the fragmented, ever changing, always under-documented business processes onto the data these processes create. This is so hard that people need sit in meetings and exchange business processes knowledge from brain to brain via communication.

Without this business process knowledge, where the most up to date version sits in brains, data modeling cannot be done. And hence an AI that does not somehow aquire this business process knowledge, can not produce meaningful data models.

Maybe all AE should become some kind of documentation / config file maintainers that creates a standardized mapping between business processes and data that is efficient to maintain and interpretable by AI.

Expand full comment

So I have two (very belated) reactions to this:

1. On the the work being done on full-join models being kinda trivial, I hear you, though I think AI could make that work a good bit better. My guess is that most people interact with it via some pivot table like interface in a BI tool. That’s not bad, but if you could do the same thing with more natural language questions, I think you could take that a lot further. For example, Tableau is basically a giant pivot table that you can do all sorts of crazy comparisons on top of For example, rather than just aggregate by X, you can aggregate by X over a rolling window, and compare the growth rate of that aggregate with the growth rate of another aggregate, etc. If people could ask questions of that form as easily as they could ask “Show me revenue by region,” then I think “self-serve” could go a lot further than it does today.

2. On the real work being modeling on the messy stuff, I agree that that’s hard (though because of the point above, think that the other work is also hard). I’m not sure how you solve that to be honest, though you perhaps imagine a world where you just describe how those systems work, and the AI thing writes the various integrations for you. “Salesforce does this, it works like this, here are the various things we expect it to do, etc.” So in that case, people still need to describe the mapping, but they don’t have to translate the plain-language mapping into code.

Expand full comment

Great piece! We solve the problem of supplying the semantics to the LLM (as well as constraining the queries it generates) by representing the data via a knowledge graph and using its ontology as the "Rosetta stone" to bridge the human language question to the graph query. In general, an ontology previously designed to aid human data consumption, elevates all the data to the conceptual level, removing the ambiguity & arcana (and the cruft of multiple underlying schema model layers) and can easily be annotated in the spots where a bit more help is needed. Ontologies can be used to describe a very broad data landscape comprising many contributing sources, as well as really complex data and a decent graph store is designed for completely ad hoc queries on all of the data present, allowing the end user to follow their nose through the data, wherever it might take them.

Expand full comment

Thanks! So does that mean you've built your own LLM that trained on those semantics, or do you prompt it with some representation of that knowledge graph + the ontology?

Expand full comment

For expediency we are currently we are mainly using gpt-4 with a bit of 3.5-turbo where we can and are prompting it with compressed ontologies. I have no doubt though that we will end up either fine-tuning one of the mainstream commercial services (when available) or training our own - quite a few that I have tried already produce reasonable SPARQL . There are simply too many obvious advantages to ignore working towards (automatically) fine-tuning a model per individual knowledge graph as soon as practical.

Expand full comment

Gotcha, yeah, that seems to be the trend, with the expectation that we'll get to the point where you can train your own models at some point relatively soon.

Expand full comment

In a couple of years, when interest rates are relaxed:

1) Senior Management is going to buy into the marketing message of LLM Data Analytics Vendors supported by the System Integrators who see it as a way to earn more money.

2) The Head of Data and the new batch of junior data analysts that are fresh out of university will see LLM Data Analytics as a way to advance their careers.

3) They'll find that their (not-so) Modern (anymore) Data Stack is a mess, and that the data has no semantic meaning, leading to bogus LLM Data Analytics.

4) They'll look for cheap ways to add semantic meaning to the data (data tagging).

5) As the automated data tagging startups still haven't successfully managed to tag data accurately, Africa will extend its existing data labeling infrastructure to add data tagging and create generational wealth for Africans (At least, that's what I hope).

Source: https://www.mantralabsglobal.com/blog/ai-in-africa-artificial-intelligence-africa/

Expand full comment

I'm not sure about the endpoint (though, seems good?), but it certainly seems possible that we end up in a world where data teams throw a bunch of automated (or semi-automated) tasks at various data problems, like people shoveling coal into a train engine. I'm not sure I fully believe that - I doubt it'll be so simple - but I could see outlines of that emerging.

Expand full comment

Activity schema was basically what Zynga ran on. Most of the data was in a table called ztrack_count where what was being counted was defined by the caller in five fields: kingdom, phylum, class, family, and genus. PM's defined a schema around these fields and did funnel analysis and so forth from sequences.

The main reason was to decouple the PM's work from the core data group; we could only handle a small rate of schema changes, so we created this open-ended event class and let them define the tracking fields.

Expand full comment

Did it work? I've heard of a few versions of this, and the people who made them all swear by them. But then it's not very widespread, which makes me wonder why.

Expand full comment

It's specifically for instrumented events, which are simpler than most data warehouse entities. In fact at the time I didn't think of it as a data warehouse, but as an events store that used data warehouse software. We would also refine the raw data into more structured user profiles.

It worked well because code flows are complex, and we needed confirmation that each step had been reached; for instance an install might come through an alternate flow from an email link instead of the app store. So lots of small events to confirm the process were better than a single big install event, which might not even be reached.

It did take a long time to develop and test these schemas on the dev end. Fields would be null, or the logging point might be skipped or overlogged in a loop, or a bunch of other coding problems. It was more that the flow had to match expectations rather than the data matching a single schema -- all the logging points leading up to an install, including identifying the source, first visit, etc., had to be correct, versus creating a single wide row with all the install information collected (also hard), or a snapshot of the user table in a traditional OLTP-based warehouse.

Expand full comment

Gotcha, that makes sense. In a lot of ways, I suppose this architecture isn't really that much simpler than a more traditional one; it just changes where the complexity is. Data collection becomes way more important, as does making sure event pipelines are very reliable. That's a very hard problem you get to avoid with a non-event based seutps.

Expand full comment

Patterns are deceptive. We see patterns (or hear them) and think they mean something. Sometimes that's true but not always. Herein lies the problem. Many patterns mean nothing while things that aren't patterns can mean a lot.

Turning on prime numbers is not a pattern. It's a rule. Prime numbers don't form a pattern - part of intense research in mathematics. What has been discovered is possibly a rule but there is an infinite number of prime numbers and we don't know how to find them all so the rule is at best a guess and could be disproved at any time.

People think they can identify patterns in business models but as an open system it can change significantly at any time. That makes share markets unpredictable. It makes tipping points possible even though we often don't know where they are. And so on.

I have worked with fashion designers and retailers (footwear) for over 40 years and I can tell you they don't know. All educated guesses. The market can change dramatically for reasons they don't see and thee is the problem.

Patterns are the problem not the solution.

Expand full comment

The point isn't about the semantics of what is makes a pattern or a rule though; it's just that there are ways the world works, some of which we can see and some we can't. Those things aren't necessarily ironclad laws, but could just be relationships or correlations.

The Balaji bit makes me think of macroeconomics more than anything. We've learned a lot about how the economy interacts with itself over the last 100 years - how monetary policy broadly affects things, the impact that certain policies can have, etc - and that's largely been beneficial. (Yes, people can argue that we've made lots of mistakes or that the prevailing theory is wrong, but that's not the point; the point is that there are some theories that make us better able to make policy decisions once we understand them.) Those theories aren't patterns or rules, but just, a bit of understanding about the world.

Surely, though, there are other such theories out there that would help us even more, but we haven't found them yet. And that probably applies to how companies operate too. There are things out there we haven't seen yet - call it a pattern, a rule, a relationship, a theory, whatever - but if figured it out, we'd have an easier time navigating the maze.

Expand full comment

From my observations.

AI is good when

1. Training data is high quality and plentiful

2. Consequences of failure is low

3. Precision requirement is low

Self driving failed, because it is exactly opposite to those three requirements

1. Training data does not exist. There was no pre-existing database of first-person driving videos. And videos don't capture the full world state.

2. Consequences of failure is death.

3. At high speeds, a slight mistake is fatal.

AI art succeeded (With 1/100th of the investment into self driving), because

1. Training data is extremely plentiful, full human history of art, and many had extremely quality labels, not by mechanical turks(cheap outsourced labour), but by passionate enthusiasts

2. Consequences of failure is non-existent, wasted 20 seconds at most.

3. Good enough art is good enough. An extra finger here or there, the human eye can ignore.

Data analytics looks more akin to self driving, than to AI art.

1. SQL and python queries are extremely plentiful. However, the databases which they depend on, are not collected. They are usually proprietary company data. This fundamentally separates normal code with data code, as data code depends on context beyond the code itself to verify correctness. Cloud databases can technically know both the SQL query and the dataset, but is the market big enough for automated analytics, that the cloud providers want to risk prying into their customer data?

2. Consequences of failure are moderate. Core metric failures can mislead investors, mistakenly fire employees etc. Human oversight moderates this, but if a human cannot trust a metric, their usual instinct is to avoid it altogether.

3. Precision requirements are high. Good-enough data is rarely good enough, overstating your sales revenue by 20% is not a trivial mistake. Subtle mistakes are the hardest to detect, which makes them precisely the most damaging.

Hence it appears to me, that analysts are hard to replace. In particular, AI augmented analysts, which will probably be 2x-3x as productive as old analysts, will be cost-effective enough to make pure AI analytics unattractive in comparison.

Finally, AI itself demands titanic volumes of data, and we may be the coal shovers of the new industrial revolution. Coal shovers earned more than farmers of their era, even though their work is low skill, simply because they are proximate to the engines of profit.

Expand full comment

So all of this makes a lot of sense to me, with one big potential exception: Though we don't really talk about it this way, a lot of an analytical work is speculative. The precision that's required is based on formulas and definitions we made up, and the analysis we do is just one of thousands of paths we could take through some dataset to find something interesting. For example...

Suppose we're an apparel company, and we have a metric that defines revenue from new vs existing customers. There's not really a standard way to measure that. New could mean never bought anything before, or hasn't bought something in X months. We've got to figure out how to classify people who we can't easily identify. We've got to figure out what do do with returns, gift cards, exchanges, etc. Some AI analyst might do this differently than we would, but - is that wrong? As long as it's consistent over time, it's probably fine. So with the exception of very standard stuff, I don't think things need to be that precise with some external notion of Truth (within reason, of course); they just need to be internally consistent.

For exploratory analysis, this seems even more true. Analysts go off looking to answer a question; what they come back with is just one of at least dozens of possibilities. Would an AI "think" about the problem in the same way? Almost certainly not. Is that worse? Doesn't seem like it necessarily is.

All that said, I don't think this implies we'll lose our jobs or something. History's full of a lot of predictions about how technology will replace this or that job en masse, and those predictions don't seem to be right very often. So I'd fully expect our jobs to change, but no idea how.

Expand full comment

Right now AI is missing one or two really important things. Entropy - randomly choosing statements, words, images, created or otherwise is really high entropy and should be detectable on this basis alone.

The other thing is that organisms and humans in particular have goals and understand semantics. AI has neither and until then is going to be marginal or maybe useless.

One of my business partners tried ChatGPT to generate descriptions of our product etc. Total rubbish.

Expand full comment

Why is entropy important?

On the semantics part, agreed - but i'm not so sure will *really* semantics matter. They do if we assume that AI will answer questions in the same way that people do, by exploring a problem semantically. But I'm not sure that'll really be necessary, especially if we stop architecting our data models around semantics.

Expand full comment

These LLM's can even begin to write half-decent SQL because of the huge training corpus of data that already exists on the internet. If we decided to change the data models to something more "AI-friendly", I wonder if we'd have to let that bake with humans for some period of time to generate enough Stack Overflow threads to train a new model on.

Expand full comment

That's an interesting point. I don't know enough about how these things work to know how important that would be. (Though they seem to be able to teach themselves these things in some ways, so it might be able to bootstrap itself: https://maximumeffort.substack.com/p/i-taught-chatgpt-to-invent-a-language)

Expand full comment

Really interesting! To be honest I think the marginally more likely scenario is that AI has a quicker impact at the analysis end of the data pipeline, and is used to produce summary analyses, descriptive statistics, etc.

Expand full comment

I agree that's where it can have an impact, though I think that why the data models underneath it matters so much. Right now, those models are designed for people, which feels like it'd limit how well AIs could do that that analysis. If they were instead working with data that was structured in a way that fit how they "thought," it seems like they could go much further than they can today.

Expand full comment

> What if instead of building new patterns for people, we built them for our AI gods?

I for one welcome our new schema-less AI overlords. At least if they are smart enough to give me a list of the actual events they included, AND a sample of the near-misses they excluded.

Expand full comment

Yeah, this is one of the things that would be important in this problem I imagine. I don't think a number would just be enough; I suspect they'd have to also "explain their work" like is shown here: https://ai.googleblog.com/2022/05/language-models-perform-reasoning-via.html

Expand full comment

> Nobody’s going to use a tool that is randomly and confidently incorrect half the time

*cough* Stock analysts *cough* TV experts

Expand full comment

Yeah but I pick stops incorrectly 75% of the time so I'll take it.

Expand full comment

Preach it! 🙌

Expand full comment

If I were more shameless, I'd start promoting C.U.T.E.

Customers, Users, Transactions, Events.

All based more or less on activity schema. This is actually how many marketing/GTM LOBs and CDPs model data for consumption to APIs for messaging, ads, etc.

I'll buy the AI hype around data when it can deliver on entity resolution to create entity models.

The query to the entity is obvious. Creating these entities is harder.

Expand full comment

It seems like that's potentially one way it goes, where everyone has as basic schema like that. And if that's what happened, I think creating those entities is hard-ish. Sure, you've got to deal with entity resolution and all that, but it's like solving a maze backwards to me - once you know exactly the structure of what you want to map things too, the problems get a lot easier.

But there's another potential route this goes, which is that the models we make start looking weirder and weirder because they're designed for computers and people. This was Josh's point here, and I could see something like this happening: https://twitter.com/josh_wills/status/1619045473150734336

Expand full comment

Entropy (see "Shannon Entropy") is effectively a measure of the randomness of some data. Entropy in a closed system must always increase (2nd law of thermodynamics). In an open system energy can be consumed to reduce randomness. This is a possible definition of life. Certainly James Lovelock proposed it as a way for NASA to detect life on other planets - basically atmospheric analysis would show chemicals that could not occur without energy consumption.

While I don't know how to do this yet my hypothesis is that if you compare a ChatGPT text with a human written text (even someone not very bright) the entropy of the former will be quite high while the latter will be low.

This is at the heart of so many things.

Seen in this light semantics guides the application of energy to a lower entropy.

My research on data models and semantics has shown the importance of a higher level of data model that uses semantics and AI to better understand the business model.

The implication is that with this approach applications can be built that cannot be built using traditional methods.

It's all a bit like the monkeys and typewriters scenario. The assumption that all events, randomly produced, are equally likely is wrong. low entropy events (eg the works of Shakespeare) have a probability near 0 whereas high entropy events (random text) have a probability near 1.

I could go on for days but I trust this enough for now. :)

Expand full comment

One aside that that make me think about - I wonder if an AI would identify the same patterns in a business model that people do (obviously, in some ways, it would, because of how it's trained, but putting that aside for a moment.) I hate having to non-critically quote Balaji twice here, consider this clip: https://twitter.com/goth600/status/1618711673535332352

The entire way that we see patterns - in business models, and other things - is limited by the patterns we can see. That doesn't mean there aren't other ways to see them, and aren't potentially more powerful patterns that we're just missing. That seems like the real value of this - for us to start realizing there are other patterns to how businesses operate that we haven't been able to find yet.

Expand full comment