# Why do hash functions use xor to combine multiple values?

Continuing the discussion from Dictionary lookup errors when the struct contains empty vectors:

Usually in these hash function one sees `xor` being used to combine the results of multiple values. I myself usually also use `xor` but out of cargo cult essentially. So I wanted to ask: Why exactly is it `xor` and not `+`? At the surface level (and with wrapping arithmetic) they seem very similar. After all a single bit add is just a `xor` that sets a carry/overflow flag. So is the origin of `xor` in systems where integer overflow is bad or just mathematical convenience or is there something inherently âsaferâ in `xor`?

1 Like

Because xor-ing a random value with something else yields another random value. Xor-ing multiple pseudorandom random values together can make it appear even more random.

1 Like

And summing them would cause a âregression to the meanâ kind of thing due to central limit theorem, so make it less random? Is that a valid point of view?

I have also seen the following nested pattern somewhere, can anyone comment on the difference with `xor`?

``````struct MyStruct
a
b
end

Base.hash(x::MyStruct, h::UInt) = hash(x.a, hash(x.b, h))
``````
2 Likes

I would say in julia with its 2 arg hash that that yes, using the second arg is indeed preferable over `xor` (or `+`).
You can see the history of juliaâs 2 arg hash in RFC: new approach to efficiently hashing 1, 1.0, big(1), the same. by StefanKarpinski Âˇ Pull Request #6624 Âˇ JuliaLang/julia Âˇ GitHub
when i suggested xor I ironically forgot the whole point of juliaâs 2 arg hash system, to make combinging them well defined.

The general property you want from a hashing function is for things that are similar or that go together in some sense should hash far apart, unless they are equal.
This is a property that `xor` gives cos it is combining them in a way that doesnât preserve any meaning of the bits relative to each other.
In contrast + preserves some meaning in that you get a carry if both bits at a lower position are 1.

3 Likes

But avoiding collisions is not necessarily bad for hashes, right?

In any case, both `+` (with modulo arithmetic, ie the overflow behavior we have in Julia) and `xor` form groups over `UInt`s.

I think that using `xor` is a historical tradition because in the olden days it was faster than `+` (you can do it bitwise, which is relevant when the CPU canât fit the whole hash into a register). Dunno if that is the case still.

To do hashing âproperlyâ, one would have to think about the properties of the hash function and experiment with it, but in practice is rarely done for user-defined types unless there is a problem that needs to be addressed.

Itâs generally good to avoid collisions with hashes - e.g. if youâd just use `reduce(xor, transcode(UInt8, "foo"))` youâd get the same as `reduce(xor, transcode(UInt8, "oof"))`:

``````julia> reduce(xor, transcode(UInt8, "foo"))
0x66

julia> reduce(xor, transcode(UInt8, "oof"))
0x66
``````

This is because `xor` is symmetric. The same is of course true with `+`, so the usual fix is to make this non-symmetric, e.g. with `hash(a)*3 + hash(b)`. See this StackOverflow answer for more information.

5 Likes

Is this really good practice? Wouldnât it yield identical hash identities for any struct with the same bit values for `a` and `b`?

I would think the way to hash a struct should be like this:

``````struct MyStruct
a::T
b::U
end

function Base.hash(x::MyStruct, h::UInt)
seed = hash((:MyStruct, a:, b:), 0)  # compile-time constant
hash(x.a, hash(x.b, hash(seed, h)))
end
``````

But perhaps some sort of hashing of the type is already taking place in the one-argument form.

The one-arg form falls back to `hash(x, zero(UInt))`, which in turn falls back to `hash(x, h::UInt) = hash_uint(3h - objectid(x))`. `objectid` in turn does take the type into account:

``````julia> struct Foo end

julia> struct Bar end

julia> hash(Foo())
0x9e2c647635155a8c

julia> hash(Bar())
0x5669dd55187f2041
``````

Note that there are some specializations for e.g. integers in `hash`, such that `hash(1) === hash(0x1)`.

The common idiom I see and like here is to also hash the type itself. E.g.,

6 Likes

`xor` is useful when you want to exploit its symmetry. I used `xor` (see here) when writing the `hash` function for a type that represents an undirected graph edge:

``````julia> using GraphTypes

julia> hash(Edge(1, 2)) == hash(Edge(2, 1))
true
``````
2 Likes

Another reason:
xor is more easily implemented in hardware (much fewer transistors or âareaâ) and sometimes has smaller latency (because no carry calculation circuitry needed). Therefore it was used often in older designs and very high throughput designs andâŚ inertia does the rest.

1 Like

6 posts were split to a new topic: Internals of types and how they hash