Startup Data Infrastructure

September 17th, 2018

Data infrastructure at big companies gives individual analysts tremendous leverage and flexibility. At Dropbox (and other series B+ startups), every analyst has:

Full SQL access to product tables (minus legally sensitive data)
Linkable queries and durable results for sharing analyses
Scheduled rollups to build shared abstractions and improve query performance

None of that exists at most startups. Early founders can get pretty far with Google Analytics + Google Sheets for business data and psql for product data. Once they find traction and want to be able to JOIN that data together for more sophisticated reports, they’re suddenly awash with products promising commercial solutions that will “scale to millions of users” but cost either considerable time or money.

When I joined Sentry, I found myself with the same problem. Google Analytics gave me some nice graphs, but for “real” analysis, I would ssh into production, psql a \COPY command, scp the csv over to my laptop, upload to Sheets for visualizations, and email charts to the team.

Unfortunately, because I was the only person creating these charts, I spent a lot of time cleaning up the data and persuading the team to use metrics in decision-making. During my time, I saw us migrate between 3 analytics tools and 2 types of databases, and from this experience, whenever someone asks what they should use for their analytics infrastructure, I overwhelmingly tell them: Postgres.

Because you’ll be relying on the same technology for your data as well as application infrastructure, you reap two critical benefits:

It’s cheap, both in dollars¹ and time. Using familiar technologies means it’s easy to setup/maintain and data replication is easy.
The point at which your application infrastructure no longer scales will be roughly the same time your data infrastructure begins to fail. At that point, you should have either the staff or the funds to build or buy a more bespoke infrastructure.

Starting from this base, the data infrastructure problem breaks down into two parts:

Loading: bring data from disparate sources into a single Postgres database
Managing: giving your team the access and frameworks for analyzing the data

Loading Data

Application

If you’re running a version of Postgres older than 10, use foreign data wrappers. This is a little tricky because anytime you deploy a schema migration, you’ll need to re-create these foreign data wrappers each time. Adding this to your deploy scripts should do the trick.

If you’re running Postgres 10.4 or newer, I recommend setting up logical replication, which allows you to do per-table replication and streams over any changes.² The advantages here are 1. performance and 2. simplicity (instead of needing to re-create foreign tables each time).

Third Party Data

Next is loading third party data (like Intercom, Stripe, Salesforce) into your analytics database. Segment or Stitch both do this and support Postgres as a destination. We’ve experienced occassional replication errors with both, so we built Stripe, Front, and (soon) Intercom sync into our product. Ultimately, the tool doesn’t matter very much: what matters is having high integrity data.

Event Data

Largely browser events (page views, clicks). Segment offers this functionality, but I’ve found it to be slow (syncing only once a day) and expensive at any real scale. Syncing once a day doesn’t matter for the reporting case, but for operators (those working directly with customers), it’s unbearably slow.

To solve this, I helped build Reload while I was at Sentry and Chord afterwards. Chord works for apps with a few thousand users but requires no server administration. Reload requires a small server but scales up to hundreds of thousands of users. Both cost pennies on the dollar compared to commercial offerings and provide near real-time data.

Transforming Data

Every employee at your company should have the ability to write and share analysis (usually SQL). For a small faction of queries, you’ll want to create rollups, either for convenience (removing junk data) or performance (providing summary stats).

Postgres offers this via CREATE VIEW or if your view is particularly expensive, you can materialize it and set a cron for REFRESHing it. If you’re using foreign data wrappers, as part of the transaction that drops and re-creates the foreign table, you should also re-create any downstream/dependent views. Get a list of all dependent views and then recreate them at query time. To simplify namespace collisions, give each of your employees a schema namespace and per-schema users to let them play and build their own view abstractions.

More than analytics data, this universal access is the most likely to cause stress on your data infrastructure (ballooning the size of your Postgres database). This problem can be mitigated with trust and the data to self-regulate: training and providing per-schema sizes and per-user aggregate query time to the analysts themselves. This obviously won’t scale infinitely, but at the point your analytical load is impressively outpacing your database hardware, the organization will almost invariably have hired a full time data engineer.

AWS and GCP both have managed offerings that costs less $10 a month starting out and maybe $100 if you have a terabyte of data. ↩
Both GCP and AWS offer WAL read replicas, but you shouldn’t do those—you’ll want to import other data into your data warehouse and it’s generally unwise to create tables in read replicas because of the possibility of collision. ↩