Cohort Analysis with SQL

June 28th, 2018


The General Problem

I’m generally suspicious of rollups (instead of just directly passing the salt), but in the case of cohort analysis, I found myself writing the same CTE often enough that we rolled out a cohort table this year. It’s be invaluable and worth sharing. The basic schema (mostly self-explanatory):

user_id (str)
signup (datetime)
days_since_signup* (int)
active (bool)
paid (bool)
plan (str): The user's plan
arr (int): Annual recurring revenue. Valuable to summing the business value of various types of users.
features (list): Features this users has adopted
usage* (list): What features this user has used today
views* (int): How many pages this user has visited today

Queries

The rollup is an abstraction that reformulates questions around user journeys instead of user base states. So instead of “How many active users do we have today?” you’re more inclined to ask “How many of last month’s users have remained active today?”

SELECT cohort_day, float(countif(active)) / count(1) AS active
FROM cohorts
GROUP BY cohort_day

At the end of the first month, what features tend to correlate with activation?

SELECT CAST(features AS STRING) AS features,
  countif(active) AS active
  count(1) AS total,
  float(countif(active)) / count(1) AS percent_active
FROM cohorts
WHERE cohort_day=30
GROUP BY CAST(features AS STRING)

*I’m also inclined to add days_since_paid, but it’s more complicated because you can churn in and out of paid.