Structural hash collisions

There is some discussion over at github about switching out murmur3 for a different hash function.

If we open such a can of worms, then we might also address structural pecularities of our hash construction. I’ll start this off with the following structural collision that I haven’t seen discussed before, for your special amusement @StefanKarpinski :

julia> x,y,z = rand(3);h0=rand(UInt); hash((x,y,(z,)), h0) == hash((x,(y,z,)),h0)
true

There are lots of different structural collisions where this comes from.

Is this a problem or just a fun little pecularity? Should it move to github?

edit: on github

3 Likes

Of course. AFAIK a hash collision is not a bug, but getting rid of simple ones like this one is highly desirable.

BTW this doesn’t seem like it’s related to the question of switching the basic hash function, it concerns separate methods of hash.

EDIT: I think the issue should be easy to fix, I’m on it.

3 Likes

The underlying issue is that julia is a very special unicorn, with respect to hashing.

It is built on Base.hash_uint, which is appropriately modeled as a random bijection (and is the murmur3 mixing function in a trenchcoat).

From that, Base uses a haphazard mix (no pun intended) of compression functions hash(x::T, h0::UInt) = ..., which often is something linear a la hash(x, h) = hash(x) - 3h or something like that.

This is somewhat advantageous: The superscalar CPU will compute all the different hash_uint invocations in parallel since there is no data dependency!

It is also somewhat risky: The linearity and simplicity of the chaining can expose tons of structural collisions or relations.

It is pretty hard to get all 3 of: Trivial extendibility, hidden superscalar parallelism and no structural relations.

So I’d wait until Stefan chimes in about how intended that hidden superscalar parallelism was in his design, and we’d need some extensive benchmarking to see how critical it is to current performance.


Generally, there is a field that deals with that kind of issues: Cryptographic engineering. Out of primitives (e.g. murmur3 mixing function, or rapidhash’s compression function) you build all the surrounding stuff, in a way that ensures (and proves!) the following property:

If the underlying primitive is ideal, then the resulting construction is also ideal.

Then it’s the job of cryptographers to come up with good primitives; or for us, to plug in something non-cryptographic.

(proving that a primitive is good is beyond us – we can’t even prove P != NP. But relative security proofs are often quite simple, and failure of generating such a proof often points to a real flaw)


PS. If you’re at it, x=>(y=>z) and (x=>y)=>z is another structural collision.

2 Likes

Fundamentally, the built-in hash function is not intended to be anything close to cryptographically secure, so looking at the cryptography literature may lead one astray here?

Regarding structural collisions, I would say that fundamentally Julia is optimized for containers with mostly concrete types, including Dict and Set types with a concrete type. So a Dict where the key could either be (x,y,(z,)) or (x,(y,z,)) is already deeply suboptimal in Julia even if there is no hash collision, so I’m not sure if it’s worth it to try to remove hash collisions in such cases if that sacrifices performance for the more important concrete-key case.

(IdDict is probably a better choice if you have heterogeneous key types, and won’t have such structural collisions.)

1 Like

I don’t see how this is related to which hash method is being used at all. The trouble is exactly what @nsajko fixes in the two linked PRs — it’s very beneficial if hash functions do more than just hash the components in order.

It doesn’t matter how anemic or cryptographically secure the hash function is. If hash(p::Pair, h) is defined as simply hash(p.second, hash(p.first, h)), you’re going to see collisions with unrelated structures and other associativities.

6 Likes

That is correct, and also true for the tuple hash.


The way we chain hashes often leads to a situation where

hash(items, h0) = sum( (-3)^(idx-1) hash(items[idx]) for idx=1:length(items) ) + (-3)^lengh(n)*h0

That is, the nonlinear murmur mixing function is applied pointwise, and the output hash is a linear combination, with coefficients from a polynomial.

That is kinda fishy, and I’m not sure how intended that is.

But then, the compression is defined by the item – and the default

hash(@nospecialize(x), h::UInt) = hash_uint(3h - objectid(x))

does break with that convention (the nonlinearity is applied to h). That’s what I meant with “haphazard”.