Managing database migrations across git branches


Photo by Ishan from @seefromthesky on Unsplash

@here Please run ./migrate next time you merge or rebase.

Database migrations are pretty hard. The combination of migrations and git branches are even more painful. Switching between branches with migrations can break your dev server and create invalid local data. In this exploration, we assume a Django-like migration system, where migrations are numbered monotonic, but not incrementing. You can have migrations with the same number and so long as they’re not altering, updating, deleting, or dropping the same database objects (which is true 99% of the time), you can apply migrations within the same version in any order. Instead of just tracking the migration’s version, the database’s migration table would track name of the migration.1

However, locally checking out branches with migrations is still painful. Rolling back migrations is hard if not impossible, particularly when data is mutated or dropped. This means most code reviewing for migrations are deductive instead of empirical.

But what if you moving between branches locally was seamless? What if there were a tool (I’m dubbing “Digress” here) where every migration should actually migrates a copy of the database? Digress would keep track of which migrations have been applied to each copy. If I go back in history, it switches to that older copy. If I switch to an alternate branch, we swap to the database with the last known migration, make a copy, and then migrate forward.

Digress would snapshot2 the database before applying the migration. If you swap to a branch with a migration based on an older branch, digress migate will notice the migrations directory diverged from the “HEAD”. It will first snapshot the existing database, locate the most recent “common denomiator” snapshot, load it, and then apply your migrations on top of that. Snapshotting before migration ensures the most recent local database state is kept and when you return to master, your “master” state remains.

Note: This is the naive, slower approach. See “Pages as blobs” for a more performant solution.

One of the side effects is objects that had not yet been “born” will vanish when traveling through time. New objects created on one branch would vanish when you checked out another. The data is tied to the state of the migration folder. This is unavoidable because newly inserted data might not even be valid in historical versions of the database. However, the data would re-appear once the migration folder started to look familiar.

The cool thing is not only would you be able to test out other branch migrations, you’d be able to test your own migrations. Each migration file content would be hashed, so if you messed up your migration (ex. missed a not null), you’d just re-write it and re-run digress migrate. Digress would see that the migration’s checksum had changed, revert the database back to right before the incorrect migration was applied, and then apply yours. Unlike other migration digressions, overlapping names would be deleted. There should be no reason why you should have duplicate database migration names.

To track, you’d want a digress database (to manage the snapshots globally), instead of a migrations table in the database. The snapshots table in the digress database would look something like this:

CREATE TABLE snapshots (
  name TEXT PRIMARY KEY,
  created TIMESTAMP DEFAULT NOW(),
  migration TEXT,
  checksum TEXT UNIQUE,
  parent TEXT REFERENCES snapshots,
  head BOOL UNIQUE
);

To get a snapshot’s list of migrations:

with recursive m(migration) as (
	select migration, checksum
  from snapshots
  where parent is null
  union all
	select s2.migration, s2.checksum
  from snapshots s1
  join snapshots s2 on s1.name = s2.parent
)
select * from m;

Migration Ordering

Traditionally, you might number your migration. This gets weird because long-running pull requests with a migration might have a dumb “updating migration filename” commits because someone else beat you to merging.

Instead, digress should store the ordering in the .digress directory. But it doesn’t need to be globally linear…just linear to each database object. For instance, SQL statements that modify table t1 have to be run in order, but anything that modified t2 can be run in any order whatsoever. Within the .digress directory, we’d effectively have one file per database object, with a list of the migration checksums in the order that they should be applied.

With this, git should be able to notify you of conflicts and ask you to manually resolve them. This means you no longer number your migrations with digress (though, you can if you really want to). Instead, after applying a migration, digress examines which Postgres objects were touched (by examining Postgres system catalogs). Only migrations that touch the same “leaf nodes” need to be strictly ordered.

Finally, if the .digress migration ordering is somehow screwed up (someone puts the drop table t1 before the create table t1), digress should error and not migrate the database forward until this is fixed.

Mutable Tables, Immutable Schemas

In any given migrations, the vast majority of the tables aren’t really changing, so dumping and restoring the entire public schema could be slow and space-consuming. It’s also not possible to do within a transaction. If we’re allowing the dev server to maintain its connection, it could be running its own DML or even DDL (though this would admittedly be weird) commands during the dump. Ideally, we’d actually do this within the database.

Triggers and Copies

We can track DDL changes from information_schema, which changes inside transactions and track DML with triggers3. We can then find the dependencies on those tables. Those objects can all be copied and their copies referenced by the digression. All other tables can be kept in the public schema. This is very similar to Immutable.js and Closure.

create function add_to_changes() returns trigger as $$
  begin
    insert into changes default values;
    return null;
  end;
$$ language plpgsql;

begin;
create trigger inserted_trigger after insert on users execute procedure add_to_changes(); -- this holds a lock on users table
create trigger updated_trigger after update on users execute procedure add_to_changes();
-- delete and truncate too!

-- run your migrations here

select * from changes; -- we can see what changed.
drop trigger inserted_trigger on users;
commit;

Say we have a users table. On a development branch, we create migration_A.sql that adds a new column but also a new table that references users. When we run the migration, a new digression is created. Except this digression only contains a copy of users and its dependencies, saving a perfect copy (including indices, constraints, and rows) to the digression that does not have migration_A.sql. You might continue to create objects on this branch. When you check out master again, the database switches back to the mainline digression. Whenever you switch back to that branch, your state is restored. And most importantly, when you merge the branch back into master, the migration_A digression’s state is restored.

If you created objects while in master that you would rather keep, a digress rebase-type operation could be provided to literally allow the migrations to the database’s current state instead. If digress is hooked up via githooks, this might be the default action when checking out master.

To create the copy, within the same transaction as above…

CREATE TABLE t2 (LIKE t1 INCLUDING ALL);
INSERT INTO t2 TABLE t1;

The only “bad” thing is if you have a large table and you migrate its schema (like, drop a column), then we keep an old copy of that table until it’s deliberately pruned. That’s great because it lets you revert back to old database state easily, but it does take up space.

Pages as blobs

This all feels a little brittle though. Adding triggers to all DML statements feels like it might slow everything down, if even just by a millisecond. Why pay that price with each and every transaction when only you’re running migrations only 0.01% of the time.

Let’s step out of the SQL box and examine what we’re actually trying to do. We’re trying to version the data, not just the schema. Where is that data? Postgres is a durable system, so that means the data necessarily must be flushed to the filesystem. It’s actually remarkably easy to find where it is.

show data_directory; -- show where the PGDATA directory is
vacuum full; -- we do this to flush all wal transactions to disk
select pg_relation_filepath('posts'); -- where the file for table data is, relative to the PGDATA directory

Perfect! The data for each table and index is “…stored in a separate file…When a table or index exceeds 1 GB, it is divided into gigabyte-sized segments.”4 If it’s just files, can’t we just use git? Unfortunately, no. Git doesn’t deal well with large, binary files…which is exactly what Postgres data files are. And while there are binary diff’ing tools, they have a big flaw…

bsdiff is quite memory-hungry. It requires max(17n,9n+m)+O(1) bytes of memory, where n is the size of the old file and m is the size of the new file. bspatch requires n+m+O(1) bytes.

Why is this so hard? 99% of the time, for a large, append-only table, 99% of the rows are unchanged. I don’t want to be doing massive 1GB filesystem copies when only 2kb of data have changed. Turns out, Postgres has the same problem and has solved it by dividing each file into 8kb pages. This means you can checksum the table file and its pages, storing only the changes in the pages. Generating page diffs can be done concurrently, reading read page into a separate channel and that can be diff’d in parallel. You’d store the diff’s in a DGDATA directory (separate from the .digress directory. The .digress directory should be checked into git, wherease digress database diff’s should not be because they are specific to your local database), much like the .git directory.

Not only would this approach reduce the storage costs, but would also speed things up massively. You’re not tracking, diffing, or applying patches. Instead, you’re chunking, checksumming, and overwriting pages.

Features and use cases

Notifications when database is out of sync with /migrations directory.

YOu can add a post-checkout git hook that checks if the database is in sync with the code, giving developers better visibility. This can just notify you in terminal or it can run digress migrate for you.

Won’t this take up a lot of space?

Most dev machines have reasonable hard drive (>200GB) and localhost databases are very small. When exported and compressed, the distribution of sizes skews small, with the 95 percentile <1MB and the 99.9 percentile <100MB. Combined with the immutable schema approach, it’s unlikely this will ever use more than a few dozen gigs of storage. However, for the exceptional cases, there’s no point in keeping very old snapshots around and digress prune could be used to prune snapshots starting with “oldest” first (the fewest parents and also the oldest created.

Failed Migrations

When a migration fails, represent it in the file, similar to git’s merge conflicts.

No more leading zeros

Django does migration number with a lot of leading zeros. That sucks. It should just be create-user-table.sql.

Collaborating with state

You could even sync state to a remote service. This is particularly useful when working with someone to track down bugs. You may have successfully repro’d the buggy state in your development machine. Debugging can then more easily be parallelized by sharing your state with your colleague. Because we’re treating tables as immutable instead of whole databases, it’s a bit more space efficient. Then, you could check out other people’s state without having to worry about saving and reloading your own. You can even branch off their state and send it back to them and only resend the mutated tables.

Mock databases

Having reasonable mock data for screenshots is really tedious. What if you could just use digressions? You could even tie the digression hash to your visual regression system to do diff’ing. Unlike application mocks, digressions are automatically maintained by your own developers testing out the application.

  1. Of course, the perfect solution would actually tease out such dependencies by parsing the SQL and creating the hierachical DAG. libpg_query and its per-language hooks are excellent for this. However, this would only really save you the minimal effort of prepending your filenames with 01_. In the event that you merge two migrations and there is a conflict, this theoretical solution could auto-resolve it. For instance, it could detect if a migration was altering a table that had been dropped. In such a case, the application code that altered the table still depends it existing. Obviously, this kind of conflict can easily be detect on code review, but if team has grown to the point where a single person is not reviewing all migrations and testing was lax, such a system would catch these edge cases. 

  2. CREATE DATABASE name WITH TEMPLATE='template' can’t be used because no sessions can be connected to the source database. In development, servers are generally left running, making that unacceptable. pg_dump can be used and the snapshots (or more playfully, digressions) stored in a .digress at the root level. It’d be unwise to store them in version control though, as each dev will have their own state/test cases. However, sending/sharing them should be as easy as sending a file. 

  3. information_schema can be used for inspecting dependencies. CREATE TRIGGER and CREATE EVENT TRIGGER are available for detecting DML and DDL changes, respective. Tips for writing trigger functions. Event triggers actually suck because they don’t tell us which 

  4. https://www.postgresql.org/docs/current/storage-file-layout.html