There is some discussion over at github about switching out murmur3 for a different hash function.
If we open such a can of worms, then we might also address structural pecularities of our hash construction. I’ll start this off with the following structural collision that I haven’t seen discussed before, for your special amusement @StefanKarpinski :
The underlying issue is that julia is a very special unicorn, with respect to hashing.
It is built on Base.hash_uint, which is appropriately modeled as a random bijection (and is the murmur3 mixing function in a trenchcoat).
From that, Base uses a haphazard mix (no pun intended) of compression functions hash(x::T, h0::UInt) = ..., which often is something linear a la hash(x, h) = hash(x) - 3h or something like that.
This is somewhat advantageous: The superscalar CPU will compute all the different hash_uint invocations in parallel since there is no data dependency!
It is also somewhat risky: The linearity and simplicity of the chaining can expose tons of structural collisions or relations.
It is pretty hard to get all 3 of: Trivial extendibility, hidden superscalar parallelism and no structural relations.
So I’d wait until Stefan chimes in about how intended that hidden superscalar parallelism was in his design, and we’d need some extensive benchmarking to see how critical it is to current performance.
Generally, there is a field that deals with that kind of issues: Cryptographic engineering. Out of primitives (e.g. murmur3 mixing function, or rapidhash’s compression function) you build all the surrounding stuff, in a way that ensures (and proves!) the following property:
If the underlying primitive is ideal, then the resulting construction is also ideal.
Then it’s the job of cryptographers to come up with good primitives; or for us, to plug in something non-cryptographic.
(proving that a primitive is good is beyond us – we can’t even prove P != NP. But relative security proofs are often quite simple, and failure of generating such a proof often points to a real flaw)
PS. If you’re at it, x=>(y=>z) and (x=>y)=>z is another structural collision.
Fundamentally, the built-in hash function is not intended to be anything close to cryptographically secure, so looking at the cryptography literature may lead one astray here?
Regarding structural collisions, I would say that fundamentally Julia is optimized for containers with mostly concrete types, including Dict and Set types with a concrete type. So a Dict where the key could either be (x,y,(z,)) or (x,(y,z,)) is already deeply suboptimal in Julia even if there is no hash collision, so I’m not sure if it’s worth it to try to remove hash collisions in such cases if that sacrifices performance for the more important concrete-key case.
(IdDict is probably a better choice if you have heterogeneous key types, and won’t have such structural collisions.)
I don’t see how this is related to which hash method is being used at all. The trouble is exactly what @nsajko fixes in the two linked PRs — it’s very beneficial if hash functions do more than just hash the components in order.
It doesn’t matter how anemic or cryptographically secure the hash function is. If hash(p::Pair, h) is defined as simply hash(p.second, hash(p.first, h)), you’re going to see collisions with unrelated structures and other associativities.