I'm a believer—AI will change everything about data too.
@bennstancil I love this idea of future apps having schemas (and more generally architectures) that optimize for #llm (not human) convenience. #schema-for-bots or #schema-on-bits or something 🤓
Thought provoking post, thanks Benn!
Do you have thoughts regarding the following:
The full-joined-event-tables (FJ) that you describe as being good for AI, are in my experience the very end of the DAG. In our dbt environment we have around 1800 models and use the (FJ) like models for exposure into BI tools.
Running generates SQL on these FJ models is kinda trivial because all you do with FJ models is filtering and aggregating. All the complex joins that might require business process knowledge have already been done for the AI.
So querying FJ models is not hard and also the smalles fraction of what our data department (at a scaleup) does.
The big junk of work (analytics engineering) goes into the construction of all the say 1700 models which in the end land in many different FJ tables. This junk of work would be interesting automating. But here AI is missing a crucial piece of information.
What’s the missing piece for AI? The understanding wrt to the business processes. The data landscape is so super fragmented: fetching data from 80 SaaS tools, internal APIs, public APIs, data from same sources being interpreted in different ways (business processes) depending on region ... chaos in terms of data integration.
So the hard part for the human is mapping the fragmented, ever changing, always under-documented business processes onto the data these processes create. This is so hard that people need sit in meetings and exchange business processes knowledge from brain to brain via communication.
Without this business process knowledge, where the most up to date version sits in brains, data modeling cannot be done. And hence an AI that does not somehow aquire this business process knowledge, can not produce meaningful data models.
Maybe all AE should become some kind of documentation / config file maintainers that creates a standardized mapping between business processes and data that is efficient to maintain and interpretable by AI.
Great piece! We solve the problem of supplying the semantics to the LLM (as well as constraining the queries it generates) by representing the data via a knowledge graph and using its ontology as the "Rosetta stone" to bridge the human language question to the graph query. In general, an ontology previously designed to aid human data consumption, elevates all the data to the conceptual level, removing the ambiguity & arcana (and the cruft of multiple underlying schema model layers) and can easily be annotated in the spots where a bit more help is needed. Ontologies can be used to describe a very broad data landscape comprising many contributing sources, as well as really complex data and a decent graph store is designed for completely ad hoc queries on all of the data present, allowing the end user to follow their nose through the data, wherever it might take them.
In a couple of years, when interest rates are relaxed:
1) Senior Management is going to buy into the marketing message of LLM Data Analytics Vendors supported by the System Integrators who see it as a way to earn more money.
2) The Head of Data and the new batch of junior data analysts that are fresh out of university will see LLM Data Analytics as a way to advance their careers.
3) They'll find that their (not-so) Modern (anymore) Data Stack is a mess, and that the data has no semantic meaning, leading to bogus LLM Data Analytics.
4) They'll look for cheap ways to add semantic meaning to the data (data tagging).
5) As the automated data tagging startups still haven't successfully managed to tag data accurately, Africa will extend its existing data labeling infrastructure to add data tagging and create generational wealth for Africans (At least, that's what I hope).
Activity schema was basically what Zynga ran on. Most of the data was in a table called ztrack_count where what was being counted was defined by the caller in five fields: kingdom, phylum, class, family, and genus. PM's defined a schema around these fields and did funnel analysis and so forth from sequences.
The main reason was to decouple the PM's work from the core data group; we could only handle a small rate of schema changes, so we created this open-ended event class and let them define the tracking fields.
Patterns are deceptive. We see patterns (or hear them) and think they mean something. Sometimes that's true but not always. Herein lies the problem. Many patterns mean nothing while things that aren't patterns can mean a lot.
Turning on prime numbers is not a pattern. It's a rule. Prime numbers don't form a pattern - part of intense research in mathematics. What has been discovered is possibly a rule but there is an infinite number of prime numbers and we don't know how to find them all so the rule is at best a guess and could be disproved at any time.
People think they can identify patterns in business models but as an open system it can change significantly at any time. That makes share markets unpredictable. It makes tipping points possible even though we often don't know where they are. And so on.
I have worked with fashion designers and retailers (footwear) for over 40 years and I can tell you they don't know. All educated guesses. The market can change dramatically for reasons they don't see and thee is the problem.
Patterns are the problem not the solution.
From my observations.
AI is good when
1. Training data is high quality and plentiful
2. Consequences of failure is low
3. Precision requirement is low
Self driving failed, because it is exactly opposite to those three requirements
1. Training data does not exist. There was no pre-existing database of first-person driving videos. And videos don't capture the full world state.
2. Consequences of failure is death.
3. At high speeds, a slight mistake is fatal.
AI art succeeded (With 1/100th of the investment into self driving), because
1. Training data is extremely plentiful, full human history of art, and many had extremely quality labels, not by mechanical turks(cheap outsourced labour), but by passionate enthusiasts
2. Consequences of failure is non-existent, wasted 20 seconds at most.
3. Good enough art is good enough. An extra finger here or there, the human eye can ignore.
Data analytics looks more akin to self driving, than to AI art.
1. SQL and python queries are extremely plentiful. However, the databases which they depend on, are not collected. They are usually proprietary company data. This fundamentally separates normal code with data code, as data code depends on context beyond the code itself to verify correctness. Cloud databases can technically know both the SQL query and the dataset, but is the market big enough for automated analytics, that the cloud providers want to risk prying into their customer data?
2. Consequences of failure are moderate. Core metric failures can mislead investors, mistakenly fire employees etc. Human oversight moderates this, but if a human cannot trust a metric, their usual instinct is to avoid it altogether.
3. Precision requirements are high. Good-enough data is rarely good enough, overstating your sales revenue by 20% is not a trivial mistake. Subtle mistakes are the hardest to detect, which makes them precisely the most damaging.
Hence it appears to me, that analysts are hard to replace. In particular, AI augmented analysts, which will probably be 2x-3x as productive as old analysts, will be cost-effective enough to make pure AI analytics unattractive in comparison.
Finally, AI itself demands titanic volumes of data, and we may be the coal shovers of the new industrial revolution. Coal shovers earned more than farmers of their era, even though their work is low skill, simply because they are proximate to the engines of profit.
Right now AI is missing one or two really important things. Entropy - randomly choosing statements, words, images, created or otherwise is really high entropy and should be detectable on this basis alone.
The other thing is that organisms and humans in particular have goals and understand semantics. AI has neither and until then is going to be marginal or maybe useless.
One of my business partners tried ChatGPT to generate descriptions of our product etc. Total rubbish.
These LLM's can even begin to write half-decent SQL because of the huge training corpus of data that already exists on the internet. If we decided to change the data models to something more "AI-friendly", I wonder if we'd have to let that bake with humans for some period of time to generate enough Stack Overflow threads to train a new model on.
Really interesting! To be honest I think the marginally more likely scenario is that AI has a quicker impact at the analysis end of the data pipeline, and is used to produce summary analyses, descriptive statistics, etc.
> What if instead of building new patterns for people, we built them for our AI gods?
I for one welcome our new schema-less AI overlords. At least if they are smart enough to give me a list of the actual events they included, AND a sample of the near-misses they excluded.
> Nobody’s going to use a tool that is randomly and confidently incorrect half the time
*cough* Stock analysts *cough* TV experts
Preach it! 🙌
If I were more shameless, I'd start promoting C.U.T.E.
Customers, Users, Transactions, Events.
All based more or less on activity schema. This is actually how many marketing/GTM LOBs and CDPs model data for consumption to APIs for messaging, ads, etc.
I'll buy the AI hype around data when it can deliver on entity resolution to create entity models.
The query to the entity is obvious. Creating these entities is harder.
Entropy (see "Shannon Entropy") is effectively a measure of the randomness of some data. Entropy in a closed system must always increase (2nd law of thermodynamics). In an open system energy can be consumed to reduce randomness. This is a possible definition of life. Certainly James Lovelock proposed it as a way for NASA to detect life on other planets - basically atmospheric analysis would show chemicals that could not occur without energy consumption.
While I don't know how to do this yet my hypothesis is that if you compare a ChatGPT text with a human written text (even someone not very bright) the entropy of the former will be quite high while the latter will be low.
This is at the heart of so many things.
Seen in this light semantics guides the application of energy to a lower entropy.
My research on data models and semantics has shown the importance of a higher level of data model that uses semantics and AI to better understand the business model.
The implication is that with this approach applications can be built that cannot be built using traditional methods.
It's all a bit like the monkeys and typewriters scenario. The assumption that all events, randomly produced, are equally likely is wrong. low entropy events (eg the works of Shakespeare) have a probability near 0 whereas high entropy events (random text) have a probability near 1.
I could go on for days but I trust this enough for now. :)