The current approach to array hashing and O(1) range hashing is somewhat brittle and complex.
The following might be a clean approach for array hashing that should allow O(log(N)) range hashing for integer ranges.
First, how does julia compute hashes of numbers? First, the number is normalized to a 64bit-pattern that preserves isequal
-ity. This normalization is mostly linear for small Int64 / UInt64. Then, this normalized pattern is mixed into the h-parameter (the hash-state).
In order to allow fast range hashing for ranges without trying to detect “ranges” by arithmetic on user-defined types, we need would to change some things.
First: If the h===0, then we hash(x,h)=normalize(x). In order to hash a vector of length N, we compute h(x[i], 0) for all elements and then compute the weighted sum of w[i]* h(x[i], 0). This is then mixed with the length of the vector and the state. The sum is computed in UInt64 = integers mod 2^64.
We hence need weights w[i], such that we can quickly compute sum(w[i]* (start + (i-1)* step) for i=1:N).
An obvious approach would be a geometric series w[i] = p^(i-1) with p = 2*odd +1
. Then we get
sum(w[i]* (start + (i-1)* step) for i=1:N) = start * sum(p^(i-1)) + step * sum(p^(i-1)*(i-1) for i=1:N)
We can compute both terms in O(log N) time via power-by-squaring and the geometric series (multiplications are also in UInt64; if p-1 = 2*odd
then odd
has a (pre-computed) multiplicative inverse in UInt64, and we just need to compute the parity separately in order to compute the sum for the geometric series)
Disadvantage: Now hash collisions appear on affine subspaces instead of pseudo-randomly. The affine subspaces are probably pretty random, though. Also, this does not work for Float-ranges.
Maybe-advantage: The hashing might be faster and can be done in parallel (the internal mixing for vectors needs an addition and two multiplications, one for the weight and one for combining the weight with the element; multiple threads can hash parts of the vector, and pay a cheap log(N) overhead to jump ahead via power-by-quaring).
Definite advantage: The admittedly pathological example https://github.com/JuliaLang/julia/issues/26034 does the right thing.
Number theorists are invited to propose their favorite alternate ring / weight function; computer scientists are invited to post a link to a paper that explores this better (nothing new under the sun; I bet that this approach was already analyzed to death in the 70s).
PS. Is this the right forum? I guessed that this is too unfinished to post on the issue tracker.
Also something might need to be modified for negative ranges.