Postgres Connections and Serverless Computing

September 16th, 2018


AWS Lambda and Google Cloud Functions are an absolute joy to use for glue code or personal projects. Deploys happen in less than a minute, they cost barely anything to run, and they cost nothing you’re not running them.

Just for fun, I built Chord which uses Cloud Functions + BigQuery to give myself SQL access to visitors to this site. Since I joined Numeracy, I’ve been working with some of our customers to set up basic event tracking. Chord (which requires no ops and is basically free) is an easy pitch.

The problem with using BigQuery as an analytics database is replicating production data to BigQuery is terrible. Replicating once a day is enough for analysts looking at historical trends, but customer support or sales often need near-realtime (within 1 min) access to data. Replicating once a minute quickly becomes prohibitive. Generally, I’ve seen early startups use Postgres as an OLAP and foreign data wrap a production replica to allow for easy access nearly real-time production state with almost no on-going operational cost.

So…I need to switch to Postgres. But unlike my blog, these startups might have thousands of visitors, which begets the question: how far will Chord scale?

Will Chord scale?

The main problem with scaling Chord is database connections.

Because you’re just appending to a table with almost indices, you don’t start hitting database write contraints until thousands of writes/sec. The main limitation then is the number of postgres connections. This is generally solved with psyopg2 connection pooling, but Google Cloud explicitly lays out some advice in Connecting to Cloud SQL.

When using a connection pool, it is important to set the maximum connections to 1. Where possible, we recommend that you allocate connection objects in global scope. If you use a connection pool in global scope, then we recommend that you not close connections at the end of the function call.

I was puzzled by this, but then I read “Files that you write consume memory available to your function, and sometimes persist between invocations” and “Cloud Functions often recycles the execution environment of a previous invocation. If you declare a variable in global scope, its value can be reused in subsequent invocations without having to be recomputed.”

So if you have a lot of requests, there’s a fair chance you’ll hit your warm instance and be able to re-use the connection.

import os

import psycopg2
from psycopg2.pool import SimpleConnectionPool

# The first request is slower because the connection is being established
# but is much faster this time.
pool = SimpleConnectionPool(1, 1,
                            user=os.environ.get('POSTGRES_USER'),
                            password=os.environ.get('POSTGRES_PASSWORD'),
                            host=os.environ.get('POSTGRES_HOST'),
                            port=os.environ.get('POSTGRES_PORT'),
                            database=os.environ.get('POSTGRES_DB'))


def run(request):
    conn = pool.getconn()
    # your code
    pool.putconn(conn)

Buying your way to scale

db-n1-highmem-2 (about $130/month) can hold up to 200 connections.

A warmed up Chord function takes ~ 250ms to run, which translates to 800 requests/second. Unfortunately, you can’t scale too far by just throwing money/hardware at the problem: A $1,000/month Postgres instance will only double the maximum connections.

Hacking your way to scale

After that, you can probably reduce your load with some basic hacks:

  1. Queue up events client-side and send them via the Beacon API. Browsers can arbitrarily restrict payload size (generally to 64kb), so send whenever the queue is approaching the max payload size, page is being closed, or after a certain amount of time (10 seconds) has elapsed.
  2. Implement 413 retries. If the cloud function isn’t able to secure a connection, return a 413. On the client-side, keep the events in the queue and include them in the next attempt. There’s the risk of data loss, but you can always monitor for failed connections and deal with it then.

With either of these hacks, you’re limited not by the number of events, but by the number of simultaneous browser sessions. In my experience, only a few percent of your monthly active users are simultaneously active (sending events, not just having the tab open) on your site at any given second, so Chord is probably useable up to, say, a hundred thousand monthly actives, which is a successful B2B business and even a moderately well-off B2C business.

Then what?

After that, it’s up to you to build or buy. If you build, you only need a persistent server that can queue up events and do batch inserts to the database. Queues scale basically infinitely (Google Pub/Sub scales not by number of messages, but number of gigabytes per second) and so long as you’re doing bulk inserts, there’s a 99% chance you’ll never need to do anything else (unless you turn out to be a unicorn).