Managing database migrations across git branches
@here Please run
./migratenext time you merge or rebase.
Database migrations are pretty hard. The combination of migrations and git branches are even more painful. Switching between branches with migrations can break your dev server and create invalid local data. In this exploration, we assume a Django-like migration system, where migrations are numbered monotonic, but not incrementing. You can have migrations with the same number and so long as they’re not altering, updating, deleting, or dropping the same database objects (which is true 99% of the time), you can apply migrations within the same version in any order. Instead of just tracking the migration’s version, the database’s migration table would track name of the migration.1
However, locally checking out branches with migrations is still painful. Rolling back migrations is hard if not impossible, particularly when data is mutated or dropped. This means most code reviewing for migrations are deductive instead of empirical.
But what if you moving between branches locally was seamless? What if there were a tool (I’m dubbing “Digress” here) where every migration should actually migrates a copy of the database? Digress would keep track of which migrations have been applied to each copy. If I go back in history, it switches to that older copy. If I switch to an alternate branch, we swap to the database with the last known migration, make a copy, and then migrate forward.
Digress would snapshot2 the database before applying the migration. If you swap to a branch with a migration based on an older branch,
digress migate will notice the
migrations directory diverged from the “HEAD”. It will first snapshot the existing database, locate the most recent “common denomiator” snapshot, load it, and then apply your migrations on top of that. Snapshotting before migration ensures the most recent local database state is kept and when you return to master, your “master” state remains.
One of the side effects is objects that had not yet been “born” will vanish when traveling through time. New objects created on one branch would vanish when you checked out another. The data is tied to the state of the migration folder. This is unavoidable because newly inserted data might not even be valid in historical versions of the database. However, the data would re-appear once the migration folder started to look familiar.
Digress would also warn on ordering problems. If the “missing” migration was not numbered at least the latest, it could prompt the user if it should proceed, listing out all the migrations that are “after” the to-be-applied migration for the ops person to inspect. As mentioned before, the migration numbering is a guidance, not a strict requirement.
The cool thing is not only would you be able to test out other branch migrations, you’d be able to test your own migrations. Each migration file content would be hashed, so if you messed up your migration (ex. missed a
not null), you’d just re-write it and re-run
digress migrate. Digress would see that the migration’s checksum had changed, revert the database back to right before the incorrect migration was applied, and then apply yours. Unlike other migration digressions, overlapping names would be deleted. There should be no reason why you should have duplicate database migration names.
To track, you’d want a
digress database (to manage the snapshots globally), instead of a
migrations table in the database. The
snapshots table in the
digress database would look something like this:
CREATE TABLE snapshots ( name TEXT PRIMARY KEY, created TIMESTAMP DEFAULT NOW(), migration TEXT, checksum TEXT UNIQUE, parent TEXT REFERENCES snapshots, head BOOL UNIQUE );
To get a snapshot’s list of migrations:
with recursive m(migration) as ( select migration, checksum from snapshots where parent is null union all select s2.migration, s2.checksum from snapshots s1 join snapshots s2 on s1.name = s2.parent ) select * from m;
Mutable Tables, Immutable Schemas
In any given migrations, the vast majority of the tables aren’t really changing, so dumping and restoring the entire public schema could be slow and space-consuming. It’s also not possible to do within a transaction. If we’re allowing the dev server to maintain its connection, it could be running its own DML or even DDL (though this would admittedly be weird) commands during the dump. Ideally, we’d actually do this within the database.
We can track DDL changes from information_schema, which changes inside transactions and track DML with triggers3. We can then find the dependencies on those tables. Those objects can all be copied and their copies referenced by the digression. All other tables can be kept in the
public schema. This is very similar to Immutable.js and Closure.
create function add_to_changes() returns trigger as $$ begin insert into changes default values; return null; end; $$ language plpgsql; begin; create trigger inserted_trigger after insert on users execute procedure add_to_changes(); -- this holds a lock on users table create trigger updated_trigger after update on users execute procedure add_to_changes(); -- delete and truncate too! -- run your migrations here select * from changes; -- we can see what changed. drop trigger inserted_trigger on users; commit;
Say we have a
users table. On a development branch, we create
migration_A.sql that adds a new column but also a new table that references
users. When we run the migration, a new digression is created. Except this digression only contains a copy of
users and its dependencies, saving a perfect copy (including indices, constraints, and rows) to the digression that does not have
migration_A.sql. You might continue to create objects on this branch. When you check out master again, the database switches back to the mainline digression. Whenever you switch back to that branch, your state is restored. And most importantly, when you merge the branch back into master, the
migration_A digression’s state is restored.
If you created objects while in master that you would rather keep, a
digress rebase-type operation could be provided to literally allow the migrations to the database’s current state instead. If digress is hooked up via githooks, this might be the default action when checking out master.
To create the copy, within the same transaction as above…
CREATE TABLE t2 (LIKE t1 INCLUDING ALL); INSERT INTO t2 TABLE t1;
The only “bad” thing is if you have a large table and you migrate its schema (like, drop a column), then we keep an old copy of that table until it’s deliberately pruned. That’s great because it lets you revert back to old database state easily, but it does take up space.
Features and use cases
Notifications when database is out of sync with
YOu can add a
post-checkout git hook that checks if the database is in sync with the code, giving developers better visibility. This can just notify you in terminal or it can run
digress migrate for you.
Won’t this take up a lot of space?
Most dev machines have reasonable hard drive (>200GB) and localhost databases are very small. When exported and compressed, the distribution of sizes skews small, with the 95 percentile <1MB and the 99.9 percentile <100MB. Combined with the immutable schema approach, it’s unlikely this will ever use more than a few dozen gigs of storage. However, for the exceptional cases, there’s no point in keeping very old snapshots around and
digress prune could be used to prune snapshots starting with “oldest” first (the fewest parents and also the oldest
When a migration fails, represent it in the file, similar to git’s merge conflicts. Out of order mergers (the filename numbering system has failed)
No more leading zeros
Django does migration number with a lot of leading zeros. That sucks. It should just be
1_create_tables.sql and when you hit ten,
digress should rename the files for you, adding leading zeros.
Collaborating with state
You could even sync state to a remote service. This is particularly useful when working with someone to track down bugs. You may have successfully repro’d the buggy state in your development machine. Debugging can then more easily be parallelized by sharing your state with your colleague. Because we’re treating tables as immutable instead of whole databases, it’s a bit more space efficient. Then, you could check out other people’s state without having to worry about saving and reloading your own. You can even branch off their state and send it back to them and only resend the mutated tables.
Having reasonable mock data for screenshots is really tedious. What if you could just use digressions? You could even tie the digression hash to your visual regression system to do diff’ing. Unlike application mocks, digressions are automatically maintained by your own developers testing out the application.
Of course, the perfect solution would actually tease out such dependencies by parsing the SQL and creating the hierachical DAG. libpg_query and its per-language hooks are excellent for this. However, this would only really save you the minimal effort of prepending your filenames with
01_. In the event that you merge two migrations and there is a conflict, this theoretical solution could auto-resolve it. For instance, it could detect if a migration was altering a table that had been dropped. In such a case, the application code that altered the table still depends it existing. Obviously, this kind of conflict can easily be detect on code review, but if team has grown to the point where a single person is not reviewing all migrations and testing was lax, such a system would catch these edge cases. ↩
CREATE DATABASE name WITH TEMPLATE='template'can’t be used because no sessions can be connected to the source database. In development, servers are generally left running, making that unacceptable.
pg_dumpcan be used and the snapshots (or more playfully, digressions) stored in a
.digressat the root level. It’d be unwise to store them in version control though, as each dev will have their own state/test cases. However, sending/sharing them should be as easy as sending a file. ↩
information_schemacan be used for inspecting dependencies.
CREATE EVENT TRIGGERare available for detecting DML and DDL changes, respective. Tips for writing trigger functions. Event triggers actually suck because they don’t tell us which ↩