I’m a man with a short epitaph. If I were to die tomorrow, the best material I’d leave my eulogizers is three shallow scratches I put on strange corners of the internet: A blog post about the price of weed,1 an appeal to move all the driveways in San Francisco, and a sober argument2 against leading commas in SQL queries. “He cared a lot about SQL formatting,” they might say, before butchering a poetry reading:
And did you get what
you wanted from this life, even so?
I did.
And what did you want?
...
To make an overzealous case against an aesthetically controversial but semantically meaningless tic in the geographic grammar of a half-century old query language for databases. Obviously.
Given that I am, in fact, not dead—and evidently satisfied with this as my legacy—I’ve continued to prosecute this case, and canonize the gospel of proper SQL formatting for future generations.3 To this end, I recently made an effort to document the broader set of rules that I follow when I write queries. Here is, in what will surely be a peaceful take, The Answer:
To anyone who disagrees: I will die on these hills. There should be a gutter six spaces in, because it looks nice. SQL keywords are uppercase; table and column names are lowercase.4 The ON
clauses are on different lines than the joins. The first clause in a join condition (e.g., the a
in ON a = b
) references the table in the join, and the second clause references prior tables. Table aliases should be short, initials if possible. And, as should go without saying, select statements use trailing commas.
After easily recording these rules, my review got more complicated. I needed to come up with commandments for SQL’s more sophisticated maneuvers, like case statements and where clauses with parentheses.
It was then that my faith wavered.
What’s the best way to format these parts of a query? The answer—an answer every analyst should love, if not many priests—is that it depends. Case statements, for example, solve a number of problems, and, as such, come in all sorts of shapes and sizes. There’s simply no format that works for all of them.
That’s because a query doesn’t work just because a computer can execute it; it needs to make sense to both machines and humans (because it’s people, who do the reading). Readability includes more than aesthetics (though those do matter). Queries—which, above all, transcribe vague business concepts, like revenue and win rate, into precise formulae—should also make their logic legible. We get frustrated by SQL’s computational anachronisms because they confuse our ability to follow that logic.5 Bad formatting can do the same. Good formatting does the opposite: It actually making queries easier to understand. Formatting is our medium, and the medium is the message.6
A well-formatted query highlights its important and complex elements while tucking straightforward ideas into the background. Just as data visualizations should be formatted in ways that draw out the salient differences between datasets, queries should do the same for the logical narratives they tell. And just as these differences in datasets are best surfaced by different graphs, different logical structures are best cued by different formats. As query authors, consistency shouldn’t be our goal, no more than we should make it our goal to consistently use bar charts. Instead, our job is to identify the computational studs in our queries, and make those inescapable.
For example, the case statement below is simple, and the field using it isn’t overly important to the query. Inserting a bunch of line breaks gives that field outsized importance—importance that isn’t warranted. In this example, it’s better to keep the case statement all on one line.
This next case statement is a long series of if/then statements. Those pairings are the most important thing for the reader to understand. When each logical couple is on the same line, they’re easy to understand. By contrast, a single-line case statement is hard to read, and a case statement that splits the when/then pairs visually mixes the logically different “if” and “then” clauses, breaking the clear path from one to the other.
Compare this approach with the more traditional multi-line format.
In this third example, the case statements are fairly complex, but each one is similar. Rather than follow either of the formats in the prior example, this query should highlight the single critical difference in each case statement. This is best done by leaving each statement on one line, and adding extra space to align each clause. This makes it easy to see the differences and the similarities. Splitting out these case statements onto separate lines hides this pattern, forcing readers to unnecessarily parse and compare each statement individually.
The “correctly formatted” version, however, is incomprehensible.
Moreover, the ideal format for this query might change if there’s only one case statement. In this instance, the important comparison to highlight is within the case statement rather than across different statements.
The same reasoning applies to other elements in a query. Complex where clauses often have nested AND and OR logic. The best way to format these queries is to position them according to their order of operations, much as you would draw a complex mathematical expression in LaTeX.
This approach visually outlines the logic. Standard formatting blurs it.
True, these examples don’t follow The Rules, at least not rules that can be programmed into a SQL formatter. But our job as query writers isn’t to be mechanical scribes; it’s to format our work so that it’s easy to interpret. In many cases, convention is a helpful escort. But in other cases, an unconventional format makes a query more readable than the style guide. So be it—ignore the guide.
The alternative is dogma. Adherence to a prescription simply because it’s prescribed is to be conformist prude; it’s to disappear behind a linter; it’s to elevate etiquette over creativity; it’s to judge our work by how little of ourselves is visible in it. It’s believing that Hemingway’s novels would’ve been better if they’d been edited by Grammarly,7 or that Ansel Adams’s photos could’ve been improved by torching them with an X Pro II filter. Rules, even the best documented and most hallowed ones, are sometimes best ignored.
I still get emails about this one, so apologies to whoever inherits my Gmail account and has to deal with the occasional stoner turned political science grad student looking for data to determine the relationship between cannabis prices and voter turnout.
Unhinged diatribe.
To put this in terms that some databases *cough*Snowflake*cough* might understand, TABLE AND COLUMN NAMES ARE LOWERCASE.
Coming this fall, Mem(ento)SQL, the multi-tenet database: A story where everything is out of order and nothing makes sense.
Counterpoint: I know nothing of this work, and the whole fallacy is wrong.
Or, ironically, the Hemingway App.
Wow. Seems so obvious now, but it never occurred to me. I am saved!
Hey, so like, how would you handle window statements? Example below from mode.com:
SELECT start_terminal,
duration_seconds,
LAG(duration_seconds, 1) OVER
(PARTITION BY start_terminal ORDER BY duration_seconds) AS lag,
LEAD(duration_seconds, 1) OVER
(PARTITION BY start_terminal ORDER BY duration_seconds) AS lead
FROM tutorial.dc_bikeshare_q1_2012
WHERE start_time < '2012-01-08'
ORDER BY start_terminal, duration_seconds
Seems to meet the interpretability requirements, but the partition clause placement hurts my eyes ever so slightly