Cohort Analysis with SQL
I’m generally suspicious of rollups (instead of just directly passing the salt), but in the case of cohort analysis, I found myself writing the same CTE often enough that we rolled out a cohort table this year. It’s be invaluable and worth sharing. The basic schema (mostly self-explanatory):
user_id (str) signup (datetime) days_since_signup* (int) active (bool) paid (bool) plan (str): The user's plan arr (int): Annual recurring revenue. Valuable to summing the business value of various types of users. features (list): Features this users has adopted usage* (list): What features this user has used today views* (int): How many pages this user has visited today
The rollup is an abstraction that reformulates questions around user journeys instead of user base states. So instead of “How many active users do we have today?” you’re more inclined to ask “How many of last month’s users have remained active today?”
SELECT cohort_day, float(countif(active)) / count(1) AS active FROM cohorts GROUP BY cohort_day
At the end of the first month, what features tend to correlate with activation?
SELECT CAST(features AS STRING) AS features, countif(active) AS active count(1) AS total, float(countif(active)) / count(1) AS percent_active FROM cohorts WHERE cohort_day=30 GROUP BY CAST(features AS STRING)
*I’m also inclined to add
days_since_paid, but it’s more complicated because you can churn in and out of paid.