Alternate approach to O(log(N)) integer range hashing

The current approach to array hashing and O(1) range hashing is somewhat brittle and complex.

The following might be a clean approach for array hashing that should allow O(log(N)) range hashing for integer ranges.

First, how does julia compute hashes of numbers? First, the number is normalized to a 64bit-pattern that preserves isequal-ity. This normalization is mostly linear for small Int64 / UInt64. Then, this normalized pattern is mixed into the h-parameter (the hash-state).

In order to allow fast range hashing for ranges without trying to detect “ranges” by arithmetic on user-defined types, we need would to change some things.

First: If the h===0, then we hash(x,h)=normalize(x). In order to hash a vector of length N, we compute h(x[i], 0) for all elements and then compute the weighted sum of w[i]* h(x[i], 0). This is then mixed with the length of the vector and the state. The sum is computed in UInt64 = integers mod 2^64.

We hence need weights w[i], such that we can quickly compute sum(w[i]* (start + (i-1)* step) for i=1:N).

An obvious approach would be a geometric series w[i] = p^(i-1) with p = 2*odd +1. Then we get

sum(w[i]* (start + (i-1)* step) for i=1:N) = start * sum(p^(i-1)) + step * sum(p^(i-1)*(i-1) for i=1:N)

We can compute both terms in O(log N) time via power-by-squaring and the geometric series (multiplications are also in UInt64; if p-1 = 2*odd then odd has a (pre-computed) multiplicative inverse in UInt64, and we just need to compute the parity separately in order to compute the sum for the geometric series)

Disadvantage: Now hash collisions appear on affine subspaces instead of pseudo-randomly. The affine subspaces are probably pretty random, though. Also, this does not work for Float-ranges.

Maybe-advantage: The hashing might be faster and can be done in parallel (the internal mixing for vectors needs an addition and two multiplications, one for the weight and one for combining the weight with the element; multiple threads can hash parts of the vector, and pay a cheap log(N) overhead to jump ahead via power-by-quaring).

Definite advantage: The admittedly pathological example https://github.com/JuliaLang/julia/issues/26034 does the right thing.

Number theorists are invited to propose their favorite alternate ring / weight function; computer scientists are invited to post a link to a paper that explores this better (nothing new under the sun; I bet that this approach was already analyzed to death in the 70s).

PS. Is this the right forum? I guessed that this is too unfinished to post on the issue tracker.

Also something might need to be modified for negative ranges.

1 Like

See:
https://github.com/JuliaLang/julia/pull/26022

My post was partially inspired as an alternative to exactly this discussion :slight_smile:

It skips the problem of excessive hash collisions if you mutate elements in the middle, and also builds something more like a hash-tree of recursive objects. (hash-trees are good: shorter data-dependencies; that would probably be a separate proposal)

…still, I would not be surprised if the “just take a couple of elements” approach ended up being quite good in practice (it is really fast! and it also removes the weird dependency on arithmetic of objects).

I feel like we’re kinda shooting in the dark here as I simply have no idea how folks hash arrays in practice. How often are people building up sets of arrays? Or using arrays as dictionary keys?

My hunch is that ranges are actually one of the most common array types that gets hashed, which is particularly ironic given its asymmetry: ranges are always O(1) to check equality.

1 Like

I recently wanted (but didn’t have) an array hash when I was debugging some (Python) machine learning code. I was trying to figure out when and why values were diverging between different runs, and it turned out that nondeterminism in a GPU reduction meant that arrays had rows that were swapped before they had sums (my poor man’s hash) that began to differ. So anything that is at least as fast as a sum but not invariant to transposing some elements would have been good enough for that particular use case.

My uses:

  1. When developing I like to save hashes of intermediate results. Especially if the computation generates a big result, which is then condensed into a smaller thing I actually care about; having (non-colliding) hashes of intermediate results is quite nice for debugging (where did the computation diverge? I can store the intermediate hashes basically indefinitely, because they are small; nice test sets that also check intermediate results that end up having no influence on the final output). I know that I’m lazy here: The right thing to do would be a cryprographic hash (no collisions, stable across julia versions) that may distinguish types (such that Int64(1) !== Float64(1) but [1,2,3]==[1,2,3]; this should simplify everything).

  2. Dict-access: Small integer arrays that encode combinatorial information. For example partitions of some set, permutations, subsets, etc. I know that I’m lazy here: I should define my own type, and skipping this step is the road to unmaintainable code-graveyards; but it is convenient to quickly do so. Depending on encoding, this may even lead to hashing of vector-of-vector-of-int.

  3. This is part of (2), but I’ll make it a separate point. Slapping a dict-backed cache in front of an O(N^2) algorithm in order to make it O(N) is just so damn convenient in cases where speed does not really matter. Of course this is very lazy compared to reasoning about which elements must be already computed and figuring out a more direct index structure.

So for all three cases, the proposed “hash only a small subset” approach is pretty catastrophic because my hashed arrays tend to be really similar.

However, all three of my uses boil down to convenience / lazyness on my side, and might be considered abuse of the language feature. From this point of view, I am not that concerned about the “hash only a small subset”; it might be the right thing, even if it is bad for me.

For very numerical floating-point code with long-range dependencies, hashing the entire array is absolute overkill. On the other hand, who does dict lookup based on arrays that are “numerical in nature” instead of “combinatorial in nature”?

Upon reconsideration, I think this approach is right. Float-ranges don’t hash in O(1) anyway; we only lose fast hashing of BigInt ranges.

For integer ranges, we formally get worse to O(log(N)) from O(1). The log(N) part is power-by-squaring and is really O(1), because the worst-case is pretty fast:

@btime ^($reinterpret(UInt64, -7), $reinterpret(UInt64,-1))
  224.646 ns (0 allocations: 0 bytes)
0x9249249249249249

One remaining decision is the following: Either we let x::Int64 and reinterpret(UInt64, x) collide, or we give up on fast hashing of either integer ranges containing negative entries, or we give up on fast hashing of UInt64-ranges that have negative entries when reinterpreted to Int64. I guess the latter is the right thing.

This would need a change of the normalization of numbers: Int64 remains unchanged; UInt64 remains either unchanged or becomes garbage; Float64 either becomes garbage or gets converted to an Int or an UInt; BigInt check whether they happen to be Int (they are small) or Float (trailing zeros).

I just read the concern about hashing sparse arrays, I think that can be reasonably easily dealt with.
(although it might slow down hashing dense arrays, if they are really sparse).
Simply take the indices that Matt mentioned, but if the value is zero, scan forwards or backwards (in a dense array) to the first non-zero value, and use that for the hash - the scanning should probably be limited to not go as far as the last index actually used for the hash.
That would mean, in the sparse case, no scanning at all is needed (unless you have some zeros actually stored).