Postgres Analytics Tips
Documenting some tips I’ve picked up doing analytics on Postgres. While you can always just csv dumps into Python, knowing a few Postgres tricks has definitely made it easier for me to do analytics (and not have to context switch between the two languages), as well as making the queries more legible for non-technical folks.
WHERE x in (subquery)
Joins are powerful, but in most analytics queries, you’ll find yourself with an abhorrent number of them. When you’re only using JOINs to filter (and not to surface additional columns), consider using subquery with WHERE.
Example: User’s names who have published a post
SELECT a.name FROM ( SELECT id, name FROM user ) a JOIN ( SELECT DISTINCT user_id FROM post ) b ON a.id=b.user_id
…is made more legible by…
SELECT name FROM user WHERE id IN (SELECT DISTINCT user_id FROM post)
cast(stored_json_text as json)
Postgres 9.5 has the incredibly awesome JSON type, but if you’re running an older version and have been storing your JSON as text and need to query over values, you can use cast to convert the text values into json. You’re asking your database to do a fair amount of work, so limit the number of rows you’re casting.
Once casted to json, you can access them normally.
-> accesses the value directly,
->> accesses the value as text.
Example: Posts that were published on 2016-11-26.
SELECT title FROM ( SELECT cast(metadata_text as json) as metadata_json FROM post ) a WHERE metadata_json->>'date' = '2016-11-26'
For event tables (where you’re storing a list of user actions), querying for users who have completed a single action is easy. However, querying for users who have either completed several specific actions or not completed an action is harder.
Combining array aggregates with ALL allows you to test against all values. I’d recommend filtering out early to prevent aggregate arrays from being too large (and thus taking up too much memory).
Example: Users who haven’t paid
SELECT user_id FROM ( SELECT user_id, array_agg(action) as actions FROM track WHERE action not in ('click', 'page') GROUP BY user_id) a WHERE 'paid' != ALL(actions)