About the low quality of Base.hash()

I was implementing a 1d noise function where hashes of Ints were needed, when I found Base.hash() is of surprisingly low quality. I wonder whether this should be an issue or not.

Consider the following example:

using Plots
x = -100:100
plot(x, hash.(x), label="hash(x)")

yields


Notice that it exhibits regular patterns for positive x, and the patterns are substantially inconsistent with negative x, which indicates bad hash quality.
Glancing at the implementation at base/hashing.jl, the hash_64_64() function only utilizes additions, shifts and exclusive-ors, so no wonder it has such bad quality.
Empirically, we should introduce multiplications. Replacing it with a hash derived from O’Neill, M. E. 2014. PCG: A Family of Simple Fast Space-Efficient Statistically Good Algorithms for Random Number Generation. gives a better result:

using Plots
function myhash(x::UInt64)
       x = 0xd55e8028f0ba2e7d * (x + UInt64(1))
       x ⊻= x >>> ((x >>> 59) % UInt8 + UInt8(6))
       x = 0xd55e8028f0ba2e7d * (x + UInt64(1))
       x ⊻ (x >>> ((x >>> 59) % UInt8 + UInt8(6)))
end
myhash(x::Int64) = myhash(reinterpret(UInt64, x))
x = -100:100
plot(x, myhash.(x), label="myhash(x)")

1 Like

Sorry my last thread was carelessly deleted by myself.

hash is not a cryptographic hashing function, it’s mostly used as a default fallback for Dict and Set and the like. For this purpose, more randomness can be a negative thing. Hence, the quality you observe may not necessarily be the most important factor for its implementation.

8 Likes

I see. But does Julia provide fast random integer hashing? SHA and CRC32 are too slow for my purpose.

I’m not sure that you’ll find exactly what you’re looking for in the stdlib. There is fast random number generation in the Random stdlib (using Xoshiro), but probably nothing purpose made as a 1D noise function.

Depending in your domain, there may be a package that does this?

1 Like

Thanks. I’ll implement it myself.

1 Like

see also use rapidhash by adienes · Pull Request #57509 · JuliaLang/julia · GitHub

4 Likes

This is utter nonsense. For example, chacha20 is a perfectly fine block cypher using only additions, shifts and xors.

There is no inherent quality, security, or performance issue with restricting to these operations. It might be a questionable choice for julia specifically – julia is typically run on CPUs that have good multiplication circuitry anyways; might as well use the cheap multiply if you don’t need to run on micro-controllers / ASIC.

Whether it is a good idea to use the existing fast multiply circuitry is something up to debate, in hashing, cryptography, random number generation. It certainly looses flexibility when porting to microcontrollers / fpgas / asics, and its not entirely clear whether the multiply buys you anything compared to a well-designed addition/shift/xor algorithm. But it is certainly much easier for non-experts to produce something non-terrible when you allow them multiply! (I am not an expert on that level. Don’t ask me to design crypto primitives.)

I also question the validity of your plot. Your plot simply looks at the most significant digits, but julia hash’s design goals are concentrated on giving best quality on the least significant digits! (for use in Dict / Set)

Further, if you want to look at the literature, julia’s hash is afaiu based on some murmur version.

6 Likes

This is stated a bit harshly but the basic point is correct: you can’t judge the quality of a hash function by the simplicity of the primitives it uses—even the most secure hashes are built from simple basic operations.

If anything, our hash should probably be worse. It likely does more work than is necessary for its purpose which is to be just good enough to minimize hash collisions. Python somewhat famously uses the identity function for hashing small positive integers. I’m not saying we should do that, but it’s just a case where a “bad” hash function can make sense for a variety of reasons.

14 Likes