ETL sucks

January 26th, 2023

Carrier has arrived

In my early days of Sentry, I would check on and email users who got stuck during setup: “hi, looks like you didn’t finish getting set up. i’m here if you have questions.” I wanted to figure out a way to automate this email without spamming the user: no generic getting started guides or disruptive popover product tours. My email should reflect the user’s context: where they got stuck and what they already tried. The user shouldn’t be bombarded by emails from support, sales, and marketing. If they’d already emailed support@ or if they’d tweeted at us, we should reply there. Customers should feel like they’re talking with a cohesive team, not OKRs with email addresses.

To do this, I needed usage data for both product and our SaaS services. What features had the customer already tried, what docs had they read? Had they already sent a support email? The typical solution: ETL to push data to a database like Postgres or Snowflake. ETL products (like Segment, Fivetran, Stitchdata) run periodically instead of continuously and the fastest self-serve ETL I could find synced every 15 minutes.

Most ETL products deal with big data, bulk syncing the entire state from databases and SaaS apps to data warehouses, so syncing much be periodic. This more than suffices for their most important use case, which is to run queries that took minutes that populate dashboards that are presented in tomorrow’s meeting. A comprehensive, big data view of the business is more important than an up-to-the-minute one. In that view, a 15m latency is not bad. Customers will wait 15 minutes for their support ticket. Hell, most of the time I wouldn’t even see the support ticket before the next sync.

The periodic syncing still bothered me. It felt like lag. Most Starcraft players’ APM (actions per minute) are less than 100 but everyone rages if their lag spikes to 500ms. Or the frustration of navigating Jira to update your issues. Good work requires sharp tools. In software, sharp doesn’t mean nice UIs or huge featuresets. Sharp means fast. You can’t do sub-second big data sync of third party service state. But you can push and react to small, just-in-time data (aka events) very quickly.

Enter event streams. That’s long been the answer and Kafka is the elephant in the room here. And while Kafka is battle-tested and enterprise-approved, it’s complicated and overkill, like Kubernetes when you just needed Heroku. Where was the event stream when I just wanted a <form action=...> endpoint. Or moves in a tic-tac-toe game?

Pipet is an Array with a URL. Like pipet.io/eric/foobar. If you POST to that URL, the request’s body is pushed to that Array. You can grab it with a GET request to pipet.io/eric/foobar[0]. Want to count up all the foo’s? You run reducers on the stream to aggregate data, just like a JavaScript Array.

// simple event counter reducer
(state, event, index, stream) => {
    if (event.data == 'foo') return state + 1
    return state
}

Whereas SQL gets slower as the tables grow, a reducer returns new state in constant time. When a new event arrives, the reducer is passed last state and the new event and returns next state. Plus, JavaScript is an imperative language, so it’s straightforward to write conditional triggers.

(state, event, index, stream) => {
    // too many foo's!
    if (state > 9000) request('https://slack.com/api...')
    // ...
}

This is an event stream but built for JavaScript. Pipet’s not fully baked, but I’ve sketched out the rough shape of it. Not sure if people need “Arrays with URLs”, but if you’re looking for one, talk to me.

Eric Feng

ETL sucks

January 26th, 2023