Pipet

July 6th, 2017

Businesses are increasingly software-first. These software-first businesses are able to operate at massive scale, and therefore produce massive amounts of data. While traditional businesses have had business tools that were systems of record in their own right (Salesforce for customer data, Marketo for marketing), the database will soon replace all of them as the record of truth.

Assumptions:

Software is eating the world.
The database is eating the systems of record.
SQL is eating Excel.

1. Software is eating the world.

This is an old assumption, but it’s held true. Software has remade all industries, from communication to hotels. Silicon Valley’s strength hasn’t been its busines acumen nor its technical skill (anyone can purchase that), but its understanding of the advantages of being software-first isn’t limited to building a product, but extend into building a business.

One of the long term effects of this will be the user-employee ratio. Before computing, a business might have 100 users per headcount (Bank of America, Starbucks), but automation through software has brought this ratio closer to 1,000 to 1. In the largest of consumer business, the ratio is more than 10,000 to 1.

2. The database is eating the systems of record.

Developers are choosing their own tools, and increasingly, their business’ tools. When a business is software-first, the database and accompanying internal admin are the first systems of record for customer data.

As businesses grew, Oracle, Salesforce, or Netsuite would sell you tools to manage your business data. Now that software is building businesses and not the other way around, open source (for developer tools) and APIs (for business tools) have become part of the buying processes. The effort to synchronize data two systems (especially if one is a proprietary black box) isn’t trivial. Feature gating API endpoints, rate limits, or uncomplete webhooks (all common problems in business software today) intensifies this problem. Developers prefer to buy over build, but most external systems feel crippled compared to the database.

This isn’t to say I don’t like any proprietary tools. Even the most NIH engineer doesn’t propose reinventing Gmail or Stripe. Part of that is the sheer unpleasantness of building these tools yourself (unlike sales or marketing tools which are thin database wrappers). The other part is the incredible openness they offer. You start using Stripe for payment infrastructure, just as you might use Sendgrid for email. But because they are necessarily a system of record for transactions, it becomes trivial to make them the system of record for customers, plans, refunds, etc.

3. SQL is eating Excel.

Excel is used by hundreds of millions of people. But SQL is slowly replacing Excel for extracting, transforming, and analyzing business data. While it’s unlikely existing analysts will learn this new skill, you can see the trend in new hires: most product, business, or financial analyst entering the business world today must be able to write SQL.

We need OSS to pipeline business data to the database.

Over the next few years, we should see these business tools shifting into workflow tools and increasingly offering products to pipeline their data into data warehouses. This belief has played into my fascination with Segment. Early in my time at Sentry, I introduced Segment into our stack. But as costs grew well beyond its value, I’ve removed and replaced it. I’ve been tinkering with the idea of building an open source tool (Pipet) to pull in business data in real time to Postgres (though eventually to BigQuery). While I once believed this was a “big data” problem and thus too complex to approach as a single engineer, I’ve come to realize that business data is remarkably small and databases like Postgres have become remarkably robust at handling terabytes of data at very little cost and complexity. Is it the perfect system for handling this kind of data (especially event data)? Far from it. But critically, is it good enough at small to medium scale?

To explore this (and the costs associated), I’m going to assume that baseline server costs about $5/month. Assuming each row in the database is approximately 1 kb and your average user (both anonymous and logged in) produces 32 rows/month¹ and we only need to retain data for 1 year, 1k users produces 4GB of data a year. I’m also assuming Segment discounts heavily (50%, 75%, and 85% for 100k, 1m, and 10m users, respectively).

Users	Disk (GB)	Segment + Redshift	Segment Total	Commodity + BigQuery	Commodity Total	Gross margin
1k	0.4 GB	$0 + $146	$146	$5 + $0.01	$5.01	96%
10k	4 GB	$100 + $146	$246	$5 + $0.05	$5.05	98%
100k	40 GB	$500 + $146	$646	$5 + $0.57	$5.57	99%
1m	400 GB	$2,500 + $438	$2,938	$40* + $5.67	$45.67	99%
10m	4 TB	$15,000 + $992	$15,992	$120* + $56.67	$176.67	99%

As for scalabilty, if you have any kind of queue in front of the database, you’re only limited by the average insertion performance of Postgres and of course, the maximum size of your database. Even with a very conservative estimate of 200 inserts per second, and we never bother with bulk inserts, you can pretty safely assume Postgres will function perfectly well into the millions of users. At the point you’re approaching ten million users and need a more sophisticated system (like Airflow), you can afford a data team to build a bespoke system.

These cost savings may appear trivial, but they’re not because unlike R&D, they are variable, not fixed, costs. Reducing variable cost of your free users can open up previously cost-prohibitive growth tactics. Google’s famed obsession around PUE gave them the COGS necessary to keep their search product free and therefore achieve massive reach. Nowadays, investor’s greatest concern: that the per-search costs are increasing. Amplitude spent a tremendous amount of engineering to give their free tier up to 10 million events a month, which has built up a free user base in a market dominated by Google Analytics. The cost of Dropbox’s “Getting Started” flow (that gave users an additional gig of free space) only cost them a penny a user a month, but unlocked massive growth².

Making the database the new system of record is not only a matter of convenience or cost, it should increasingly become a business need for startups looking for advantages against their incumbents.

Keep in mind this includes users who are just visiting your marketing pages, while will usually only visit a couple pages before bouncing. ↩
To give you an idea of what effects that might have on growth, Google Photos gave everyone free “limited” photo backup and they hit over 500 million monthly actives in a little over an year. Dropbox has that in registered users, whether active or not, in a little under a decade. ↩