Why do hash functions use xor to combine multiple values?

Continuing the discussion from Dictionary lookup errors when the struct contains empty vectors:

Usually in these hash function one sees xor being used to combine the results of multiple values. I myself usually also use xor but out of cargo cult essentially. So I wanted to ask: Why exactly is it xor and not +? At the surface level (and with wrapping arithmetic) they seem very similar. After all a single bit add is just a xor that sets a carry/overflow flag. So is the origin of xor in systems where integer overflow is bad or just mathematical convenience or is there something inherently “safer” in xor?

1 Like

Because xor-ing a random value with something else yields another random value. Xor-ing multiple pseudorandom random values together can make it appear even more random.

1 Like

And summing them would cause a “regression to the mean” kind of thing due to central limit theorem, so make it less random? Is that a valid point of view?

I have also seen the following nested pattern somewhere, can anyone comment on the difference with xor?

struct MyStruct
    a
    b
end

Base.hash(x::MyStruct, h::UInt) = hash(x.a, hash(x.b, h))
2 Likes

I would say in julia with its 2 arg hash that that yes, using the second arg is indeed preferable over xor (or +).
You can see the history of julia’s 2 arg hash in RFC: new approach to efficiently hashing 1, 1.0, big(1), the same. by StefanKarpinski · Pull Request #6624 · JuliaLang/julia · GitHub
when i suggested xor I ironically forgot the whole point of julia’s 2 arg hash system, to make combinging them well defined.

The general property you want from a hashing function is for things that are similar or that go together in some sense should hash far apart, unless they are equal.
This is a property that xor gives cos it is combining them in a way that doesn’t preserve any meaning of the bits relative to each other.
In contrast + preserves some meaning in that you get a carry if both bits at a lower position are 1.

3 Likes

But avoiding collisions is not necessarily bad for hashes, right?

In any case, both + (with modulo arithmetic, ie the overflow behavior we have in Julia) and xor form groups over UInts.

I think that using xor is a historical tradition because in the olden days it was faster than + (you can do it bitwise, which is relevant when the CPU can’t fit the whole hash into a register). Dunno if that is the case still.

To do hashing “properly”, one would have to think about the properties of the hash function and experiment with it, but in practice is rarely done for user-defined types unless there is a problem that needs to be addressed.

It’s generally good to avoid collisions with hashes - e.g. if you’d just use reduce(xor, transcode(UInt8, "foo")) you’d get the same as reduce(xor, transcode(UInt8, "oof")):

julia> reduce(xor, transcode(UInt8, "foo"))
0x66

julia> reduce(xor, transcode(UInt8, "oof"))
0x66

This is because xor is symmetric. The same is of course true with +, so the usual fix is to make this non-symmetric, e.g. with hash(a)*3 + hash(b). See this StackOverflow answer for more information.

5 Likes

Is this really good practice? Wouldn’t it yield identical hash identities for any struct with the same bit values for a and b?

I would think the way to hash a struct should be like this:

struct MyStruct 
    a::T
    b::U
end 

function Base.hash(x::MyStruct, h::UInt)
     seed = hash((:MyStruct, a:, b:), 0)  # compile-time constant 
     hash(x.a, hash(x.b, hash(seed, h)))
end

But perhaps some sort of hashing of the type is already taking place in the one-argument form.

The one-arg form falls back to hash(x, zero(UInt)), which in turn falls back to hash(x, h::UInt) = hash_uint(3h - objectid(x)). objectid in turn does take the type into account:

julia> struct Foo end

julia> struct Bar end

julia> hash(Foo())
0x9e2c647635155a8c

julia> hash(Bar())
0x5669dd55187f2041

Note that there are some specializations for e.g. integers in hash, such that hash(1) === hash(0x1).

The common idiom I see and like here is to also hash the type itself. E.g.,

6 Likes

xor is useful when you want to exploit its symmetry. I used xor (see here) when writing the hash function for a type that represents an undirected graph edge:

julia> using GraphTypes

julia> hash(Edge(1, 2)) == hash(Edge(2, 1))
true
2 Likes

Another reason:
xor is more easily implemented in hardware (much fewer transistors or ‘area’) and sometimes has smaller latency (because no carry calculation circuitry needed). Therefore it was used often in older designs and very high throughput designs and… inertia does the rest.

1 Like

6 posts were split to a new topic: Internals of types and how they hash