About the low quality of Base.hash()

noteas · April 4, 2025, 1:55pm

I was implementing a 1d noise function where hashes of Ints were needed, when I found Base.hash() is of surprisingly low quality. I wonder whether this should be an issue or not.

Consider the following example:

using Plots
x = -100:100
plot(x, hash.(x), label="hash(x)")

yields

Notice that it exhibits regular patterns for positive x, and the patterns are substantially inconsistent with negative x, which indicates bad hash quality.
Glancing at the implementation at base/hashing.jl, the hash_64_64() function only utilizes additions, shifts and exclusive-ors, so no wonder it has such bad quality.
Empirically, we should introduce multiplications. Replacing it with a hash derived from O’Neill, M. E. 2014. PCG: A Family of Simple Fast Space-Efficient Statistically Good Algorithms for Random Number Generation. gives a better result:

using Plots
function myhash(x::UInt64)
       x = 0xd55e8028f0ba2e7d * (x + UInt64(1))
       x ⊻= x >>> ((x >>> 59) % UInt8 + UInt8(6))
       x = 0xd55e8028f0ba2e7d * (x + UInt64(1))
       x ⊻ (x >>> ((x >>> 59) % UInt8 + UInt8(6)))
end
myhash(x::Int64) = myhash(reinterpret(UInt64, x))
x = -100:100
plot(x, myhash.(x), label="myhash(x)")

noteas · April 4, 2025, 1:56pm

Sorry my last thread was carelessly deleted by myself.

Sukera · April 4, 2025, 1:59pm

hash is not a cryptographic hashing function, it’s mostly used as a default fallback for Dict and Set and the like. For this purpose, more randomness can be a negative thing. Hence, the quality you observe may not necessarily be the most important factor for its implementation.

noteas · April 4, 2025, 2:02pm

I see. But does Julia provide fast random integer hashing? SHA and CRC32 are too slow for my purpose.

Sukera · April 4, 2025, 2:04pm

I’m not sure that you’ll find exactly what you’re looking for in the stdlib. There is fast random number generation in the Random stdlib (using Xoshiro), but probably nothing purpose made as a 1D noise function.

Depending in your domain, there may be a package that does this?

noteas · April 4, 2025, 2:15pm

Thanks. I’ll implement it myself.

Oscar_Smith · April 4, 2025, 2:26pm

see also use rapidhash by adienes · Pull Request #57509 · JuliaLang/julia · GitHub

foobar_lv2 · April 4, 2025, 4:47pm

This is utter nonsense. For example, chacha20 is a perfectly fine block cypher using only additions, shifts and xors.

There is no inherent quality, security, or performance issue with restricting to these operations. It might be a questionable choice for julia specifically – julia is typically run on CPUs that have good multiplication circuitry anyways; might as well use the cheap multiply if you don’t need to run on micro-controllers / ASIC.

Whether it is a good idea to use the existing fast multiply circuitry is something up to debate, in hashing, cryptography, random number generation. It certainly looses flexibility when porting to microcontrollers / fpgas / asics, and its not entirely clear whether the multiply buys you anything compared to a well-designed addition/shift/xor algorithm. But it is certainly much easier for non-experts to produce something non-terrible when you allow them multiply! (I am not an expert on that level. Don’t ask me to design crypto primitives.)

I also question the validity of your plot. Your plot simply looks at the most significant digits, but julia hash’s design goals are concentrated on giving best quality on the least significant digits! (for use in Dict / Set)

Further, if you want to look at the literature, julia’s hash is afaiu based on some murmur version.

StefanKarpinski · April 4, 2025, 10:04pm

This is stated a bit harshly but the basic point is correct: you can’t judge the quality of a hash function by the simplicity of the primitives it uses—even the most secure hashes are built from simple basic operations.

If anything, our hash should probably be worse. It likely does more work than is necessary for its purpose which is to be just good enough to minimize hash collisions. Python somewhat famously uses the identity function for hashing small positive integers. I’m not saying we should do that, but it’s just a case where a “bad” hash function can make sense for a variety of reasons.

Topic		Replies	Views
Use of hash code in integer hashing Internals & Design	3	692	June 2, 2022
Overloading Base.hash is 2x slower than an identical function Performance	1	344	June 17, 2021
Fast hash for custom type General Usage question , performance , hash	10	881	September 17, 2021
Hash(1) change between 1.5 and 1.6 General Usage	7	513	February 28, 2022
Is this expected `hash`ing behaviour?```julia> a = [rand(0:1, 200) for i in 1: General Usage	1	248	February 21, 2021

About the low quality of Base.hash()

Related topics