SQL AST IDE

August 25th, 2020


Trees in Hokkaido by Gang Chen
Photo by Gang Chen

Update: My thinking has since shifted.

Writing code can be very repetitive and code editors have long used autocomplete to alleviate this. Ideally, you want to suggest the most appropriate tokens based where you are in the file. Lately, more sophisticated approaches have been developed (namely treesitter and lezer) and editors parse code into an abstract syntax tree (AST) to power their autocomplete. Most of this work has been focused on imperative languages, but I’ve come to realize that SQL editors/BI tools might have to most to benefit in new, novel features.

Tab completion

This is less specific to AST (aside from the normal), but AST is very useful for SQL autocomplete because databases (particularly analytics ones) can have thousands of tables and separating signal from noise is vital for autocomplete, which generally displays 5 or less options.

Autocomplete is usually based on prefix matching sorted alphabetically. It’s a straightforward but frustrating experience. Instead autocompletion should use fuzzy search. Fuzzy search obviates alphabetical sorting, as ordering is by match type:

  1. prefix
  2. word break prefix
  3. fuzzy with first letter
  4. substring
  5. remaining fuzzy

VSCode does fuzzy search, but still requires the first character matches. This makes sense — it’s easier for performance and when searching, you almost always type the first letter. You also might want to respect word breaks, conventionally _ in SQL. So om will match both organization_member and organism, but sorts organization_member first.

Within each match type, sort by usage and alphabetically. Usage is per-user. Cross-org usage should be more used for catalog annotation. The autocomplete mechanics are pretty much standard now (from VSCode). There’s two interesting situations:

⇥|: Show most recently used object (ideally based on the cursor location, limited to the object types expected)

select * from x j|: Alias or join? Here you should probably autocomplete to join. You should assume join. The user can learn to not use keyword-prefixes for their aliases, Esc or type though a space to dismiss the autocomplete.

In addition to token autocomplete, you could even do value autocompletion.

select * from t where x |

Should show common expressions like x is null or x = now()

SQL quickview

An analytics SQL statement can grow to hundreds of lines and at that size, SQL might be one of the least readable languages. A navigable high level summary can give readers a better rough idea of how data is being extracted and transformed. In code, it’s useful to follow the interfaces, instead of trying to understand every line. For instance in golang, focus first on the structs and functions before you look at for loops and if statements. The same can be said with SQL. What matters is what data is being read (tables) and intermediary data structures created during the transformations (joins).

These quick views might just feel like convenience, but when digesting content, convenience is akin to readability . When you wrote a query with Numeracy, we’d give you summary stats on the result columns. If a column was numeric, display a histogram. If enumerated, display a horizontal bar chart of value counts. Of course you could write your own group-by-count queries.

Treemap of the SQL structure, focused on table structures. You’d have this treemap side-by-side with your SQL and clicking on components would highlight the relevant SQL text. From there, you could even run just the highlighted component. This is tremendously convenient with cascading CTEs. When the final result is broken, you often want to re-run each CTE “step” to check the data. An AST-aware editor coule you re-run just the CTE and its dependencies without having to re-write the query (toggling the with () portions).

SQL treemap editor

Goto Definition

Holding down Cmd should trigger an inspection mode. You should be able to hover and click into token definitions. When it’s a CTE, you move selection to the CTE. When its a database object, open up the schema browser to that object type.

Comment toggling

Comment toggling should intelligently removing dangling commas, and or ors. Ideally, it’d even comment out columns and have the ordinal references in group by, order by clauses also be commented out and, if none remain, the entire clause comment itself out.

select
  user_id,
  -- date(created),
  count(1)
from projects
group by 1, -- 2

Parameter types

UI Parameters allow analysts to create dashboards where viewers can re-run the query with different values without needing to understand the SQL. Dumb template variables that create text input boxes can be just as frustrating though: your viewers won’t necessarily know what to put in even. You want properly typed input boxes (especially for datetime).

You could use a custom templating language and discern the parameter’s type based on its position in the query and manifest the corresponding UI input type for viewers to use. In the below case, you could detect date was a timestamp from the projects.created field. For user_id, you’d ideally be able to link it to users.id column so the selector options would be usernames. This is pretty much how Numeracy’s parameters worked, minus the automatic typing.

select * from projects where created = {{ date }} and user_id = {{ user_id }}

Drill down charts

You can detect that two views leverage identical column values (derived from the same table and transformations, ignoring where clauses), and have charts derived from the data to share cursors (hovering over one chart shows the parallel cursor on the other chart). You can even create interactive experiences, for instance a graph showing number of projects created per day:

select date(created), count(1) from projects

If you had another chart on the same dashboard that parameterize the same column value, clicking on a day could open a table that lists all projects created on that day.

select * from projects where date(created) = {{ signup_day }}

Temp table dependencies

Temp tables can both clean up and performance boost when multiple statements rely on the same view. You can offer to consolidate theem into a temp table or even automatically do it when a dashboard’s queries are being run to boost performance. You can even re-create the temp tables, as needed, when you detect the session has been reset and temp tables lost.

AST levels

By default, you edit at the character level, but you can traverse levels. Esc to ascend and Enter to descend.

  1. statement
  2. clause (select, from, where)
  3. expression (potentially only top level expressions)
  4. tokens
  5. characters

Let’s start with this basic SQL statement.

select org_id, count(distinct user_id|) from projects

Hitting Esc repeatedly…

select org_id, count(distinct user_id) from projects
select org_id, count(distinct user_id) from projects
select org_id, count(distinct user_id) from projects
select org_id, count(distinct user_id) from projects

Left arrow key (Home should also work)

|select org_id, count(distinct user_id) from projects

Enter (or Left arrow key)

| select org_id, count(distinct user_id) from projects

Esc

select org_id, count(distinct user_id) from projects

Right arrow key (End should also work)

select org_id, count(distinct user_id) from projects

Enter

select org_id, count(distinct user_id) from projects

Right arrow key

select org_id, count(distinct user_id) from projects

Tab from here should act like the zsh tab completion above, if you hadn’t typed anything.

Enter

select org_id, count(distinct user_id) from projects|

Selection should behave like regular text selection: deletion, cut, copy, and paste should all work.