Hashing relational data


Hashing relational data

Relational data is a two dimensional multiset.

a b c
m n o
x y z

a,b,c must always be in a row. a,m,x must always be in a column. The dumbest multiset hash algorithm is simple to sort, concat, and hash for each row, the aggregation of rows, and finally for the row and column aggregate hashes.

a  b  c  → r1
m  n  o  → r2
x  y  z  → r3 
↓  ↓  ↓    ↓
c1 c2 c3 → sha(r, c) = m

The “grand hash”1 of the row and column aggregate hashes must be done with an order sensitive hash, not a multiset hash.

Development environment scale

You can hash a large collection of small tables. Assuming a normal project has 100 tables of 10 columns, 1k rows, 32 character values, hashing takes about 1 second and it goes linearly from there: ~1 second per million values. In theory, this can be improved later with better hashing algorithms. The size of the actual values affects hashing time sub-linearly (having tested from 16 to 128 bytes).

Estimating time

Time is mostly dependent on table size. Running ANALYZE takes approx 1s per million rows. autovacuum is not enabled by default in most development environments, so you cannot rely on pg_catalog counts unless ANALYZE is run manually.

select reltuples from pg_catalog.pg_class where relnamespace = ?

Incremental hashing

We cache the vector hash values. Exploiting row xmin, we can find which rows have been changed and by diff’ing against the last snapshot’s xmin, we can figure out which rows and columns must be rehashed.

a  b' c  → r1'
m  n  o  → r2
x' y  z  → r3'
↓  ↓  ↓    ↓
c1 c2 c3 → m
'  '

In this case, r1, r3, c1, c2 need to be rehashed. If we use Facebook’s homomorphic hash (go implementation here), we can even avoid rehashing the entire set. lthash32 should be resistant for up to 4.29b rows, even if all row values are duplicate.

LtHash Collision

lthash16 can have collisions if any multiset member has a multiplicity over 2^16. This is relatively common in columnar data: users.is_deleted columns are 99% the same value (false). Alternatively, adding null columns is common. Columnar hashes will use lthash32, whose hashes are 4096 bytes long.

Unless we can guarantee sort order. Tables with identical primary keys have the same sort, but what is the nature of data? Is it tied to the schema? Or is it independent of it? If we want to guarantee sort order of both the columns AND rows (alphabetical for columns, by primary key for rows), hashing becomes much more straightforward.

  1. Summing all the elements of a matrix is called the “grand sum”, hence grand hash.