Why do hash functions use xor to combine multiple values?

abraemer · February 9, 2024, 7:42am

Continuing the discussion from Dictionary lookup errors when the struct contains empty vectors:

Dictionary lookup errors when the struct contains empty vectors

Base.hash(n::Node, h::UInt) = sum(hash, children(n); init=h) 
(I am a little dubious of using sum in hash and would consider reduce(xor, but that is another discussion)

Usually in these hash function one sees xor being used to combine the results of multiple values. I myself usually also use xor but out of cargo cult essentially. So I wanted to ask: Why exactly is it xor and not +? At the surface level (and with wrapping arithmetic) they seem very similar. After all a single bit add is just a xor that sets a carry/overflow flag. So is the origin of xor in systems where integer overflow is bad or just mathematical convenience or is there something inherently “safer” in xor?

Tarny_GG_Channie · February 9, 2024, 8:02am

Because xor-ing a random value with something else yields another random value. Xor-ing multiple pseudorandom random values together can make it appear even more random.

abraemer · February 9, 2024, 8:04am

And summing them would cause a “regression to the mean” kind of thing due to central limit theorem, so make it less random? Is that a valid point of view?

gdalle · February 9, 2024, 8:05am

I have also seen the following nested pattern somewhere, can anyone comment on the difference with xor?

struct MyStruct
    a
    b
end

Base.hash(x::MyStruct, h::UInt) = hash(x.a, hash(x.b, h))

oxinabox · February 9, 2024, 9:47am

I would say in julia with its 2 arg hash that that yes, using the second arg is indeed preferable over xor (or +).
You can see the history of julia’s 2 arg hash in RFC: new approach to efficiently hashing 1, 1.0, big(1), the same. by StefanKarpinski · Pull Request #6624 · JuliaLang/julia · GitHub
when i suggested xor I ironically forgot the whole point of julia’s 2 arg hash system, to make combinging them well defined.

The general property you want from a hashing function is for things that are similar or that go together in some sense should hash far apart, unless they are equal.
This is a property that xor gives cos it is combining them in a way that doesn’t preserve any meaning of the bits relative to each other.
In contrast + preserves some meaning in that you get a carry if both bits at a lower position are 1.

Tamas_Papp · February 9, 2024, 9:58am

But avoiding collisions is not necessarily bad for hashes, right?

In any case, both + (with modulo arithmetic, ie the overflow behavior we have in Julia) and xor form groups over UInts.

I think that using xor is a historical tradition because in the olden days it was faster than + (you can do it bitwise, which is relevant when the CPU can’t fit the whole hash into a register). Dunno if that is the case still.

To do hashing “properly”, one would have to think about the properties of the hash function and experiment with it, but in practice is rarely done for user-defined types unless there is a problem that needs to be addressed.

Sukera · February 9, 2024, 3:24pm

It’s generally good to avoid collisions with hashes - e.g. if you’d just use reduce(xor, transcode(UInt8, "foo")) you’d get the same as reduce(xor, transcode(UInt8, "oof")):

julia> reduce(xor, transcode(UInt8, "foo"))
0x66

julia> reduce(xor, transcode(UInt8, "oof"))
0x66

This is because xor is symmetric. The same is of course true with +, so the usual fix is to make this non-symmetric, e.g. with hash(a)*3 + hash(b). See this StackOverflow answer for more information.

mnemnion · February 9, 2024, 4:09pm

gdalle:

I have also seen the following nested pattern somewhere, can anyone comment on the difference with xor?
struct MyStruct
    a
    b
end

Base.hash(x::MyStruct, h::UInt) = hash(x.a, hash(x.b, h))

Is this really good practice? Wouldn’t it yield identical hash identities for any struct with the same bit values for a and b?

I would think the way to hash a struct should be like this:

struct MyStruct 
    a::T
    b::U
end 

function Base.hash(x::MyStruct, h::UInt)
     seed = hash((:MyStruct, a:, b:), 0)  # compile-time constant 
     hash(x.a, hash(x.b, hash(seed, h)))
end

But perhaps some sort of hashing of the type is already taking place in the one-argument form.

Sukera · February 9, 2024, 4:23pm

The one-arg form falls back to hash(x, zero(UInt)), which in turn falls back to hash(x, h::UInt) = hash_uint(3h - objectid(x)). objectid in turn does take the type into account:

julia> struct Foo end

julia> struct Bar end

julia> hash(Foo())
0x9e2c647635155a8c

julia> hash(Bar())
0x5669dd55187f2041

Note that there are some specializations for e.g. integers in hash, such that hash(1) === hash(0x1).

mbauman · February 9, 2024, 4:33pm

The common idiom I see and like here is to also hash the type itself. E.g.,

github.com

JuliaLang/julia/blob/95233619745ef84c96af82299b162aa008a7debd/base/docs/utils.jl#L96-L96


      
          hash(t::T, h::UInt) where {T<:Union{HTML,Text}} = hash(T, hash(t.content, h))

CameronBieganek · February 9, 2024, 4:57pm

xor is useful when you want to exploit its symmetry. I used xor (see here) when writing the hash function for a type that represents an undirected graph edge:

julia> using GraphTypes

julia> hash(Edge(1, 2)) == hash(Edge(2, 1))
true

Dan · February 9, 2024, 5:15pm

Another reason:
xor is more easily implemented in hardware (much fewer transistors or ‘area’) and sometimes has smaller latency (because no carry calculation circuitry needed). Therefore it was used often in older designs and very high throughput designs and… inertia does the rest.

mbauman · February 9, 2024, 5:58pm

6 posts were split to a new topic: Internals of types and how they hash

Topic		Replies	Views
Why does my hash function return a different value to the built in hash function? General Usage	7	642	March 16, 2022
Internals of types and how they hash Internals & Design	10	412	February 9, 2024
Use of hash code in integer hashing Internals & Design	3	694	June 2, 2022
Allow 3-argument sum in Julia? General Usage	16	1211	May 4, 2022
Why must the second argument of `hash` come from another hash function? General Usage question , hash	5	205	January 15, 2025

Why do hash functions use xor to combine multiple values?

Related topics