Backfills, JSON, and Columns

August 7th, 2019


As part of our acquisition, I’ve been refactoring foundational parts of how Numeracy queries Snowflake. Simultaneously, we’re running a version of Numeracy internally at Snowflake. That’s been very useful, both for practical and political reasons, but means we’re still a service and we’re still trying to change the airplane’s engine while in flight. In particular, pairing refactors with backfills have been unexpectedly tricky.

I’ve traditionally given little thought to backfills. Schema migrations were my primary headache: ALTERing tables to ADD COLUMN...NOT NULL DEFAULT... (especially pre-Postgres 11) can lock them for prolong periods and lead to downtime. At Sentry, everything was stored as columns. If you wanted to store any new data, that meant new colums and that meant talking to ops. Backfills were easy though: you could reason about the database’s pretty easily and account for edge cases.

Numeracy stores a lot of application state as JSON. Golang makes JSON serialization very convenient and if you’re never actually searching/filtering on the data, storing it as JSON documents is technically faster. It lets you skip schema migrations, which cause downtime for large tables. ORMs are magic and magic is terrible; JSON serialization gives you the convenience of an ORM without its magic. Let me know if this sounds familiar.

For the most part, it’s been fine. Software is ever-changing and chaotic. Numeracy’s philosophy has been focusing on the interfaces between systems. Implementation can be messy, but as long as the interface is well-defined and stable, you can always evolve the implementation later. The interface between our databsae and backend was intermediated by Golang structs. In particular, their zero values protected you from dealing with the evils of nulls, nils, nones, and undefineds. Pointers could be nil, but for that reason, we tried to pass around structs instead of pointers to them whenever possible. The design of Go, therefore, largely protected us from this data debt and in the meantime, simplified our deploys.

The difficulty is that the interface for the application might be well defined, but the interface for backfills was not defined at all. The value of columns isn’t just SELECTing or WHEREing. It’s in the column constraints and the data interface they created. When you use columns, schema migrations exist to restructure the data, whereas backfills serve to repair bad data. In a JSON world, backfills have to do both. Yes, JSON made prototyping easier in the short run, but we were incurring data debt. Data debt is just as pernicious but often more hidden from the engineer than technical technical. Bad code is in your face whereas data debt is only in production.

Refactoring data is done via backfills which can be just as hard if not more dangerous. Backfills are irreversible and can result in data loss. Sure, you can have backups or even keep data the data in old fields, but the former will still lose any data received between the last backup and restoration and the latter will gradually balloon your database size. JSON data only makes this hard problem harder. Why did the created field on this row get set to null or even worse the string "yes"? Code refactors are usually more test-able or at least, runnable locally. Data refactors are much harder because cloning the production data environment can be impractical, impossible, or even illegal. That greatly increases the odds that your backfill code will have unforeen errors and the likelihood of your database being in a limbo state with both old and new JSON data structures goes up. This is a nightmare scenario (and one that I found myself in).

This is not to say that backfilling columns is infalliable. You still need to deal with the fact that on deploy, you’ll have two different application codebases sending data to your database, even while the backfill is populating. However, you’ll need to reasonable about fewer state matrices and will give you a better shot of having a fast, clean backfill.

Learning from all this, I’ve landed on a much more rigorous backfill process. It might not always be necessary, but when dealing with backfills that refactor JSON structures, I’d recommend it.

  1. When writing the backfill, log liberally and ensure it has a dry-run mode.
  2. Deploy the new feature, but hide it behind a feature flag.
  3. Run the backfill. Continue to hide the feature behind a feature flag. New data should be equally compatible with the old and new code.
  4. Validate backfilled data and add a few test users to the feature flag.
  5. (Gradually) release it broadly.