Usually in these hash function one sees xor being used to combine the results of multiple values. I myself usually also use xor but out of cargo cult essentially. So I wanted to ask: Why exactly is it xor and not +? At the surface level (and with wrapping arithmetic) they seem very similar. After all a single bit add is just a xor that sets a carry/overflow flag. So is the origin of xor in systems where integer overflow is bad or just mathematical convenience or is there something inherently âsaferâ in xor?
Because xor-ing a random value with something else yields another random value. Xor-ing multiple pseudorandom random values together can make it appear even more random.
And summing them would cause a âregression to the meanâ kind of thing due to central limit theorem, so make it less random? Is that a valid point of view?
The general property you want from a hashing function is for things that are similar or that go together in some sense should hash far apart, unless they are equal.
This is a property that xor gives cos it is combining them in a way that doesnât preserve any meaning of the bits relative to each other.
In contrast + preserves some meaning in that you get a carry if both bits at a lower position are 1.
But avoiding collisions is not necessarily bad for hashes, right?
In any case, both + (with modulo arithmetic, ie the overflow behavior we have in Julia) and xor form groups over UInts.
I think that using xor is a historical tradition because in the olden days it was faster than + (you can do it bitwise, which is relevant when the CPU canât fit the whole hash into a register). Dunno if that is the case still.
To do hashing âproperlyâ, one would have to think about the properties of the hash function and experiment with it, but in practice is rarely done for user-defined types unless there is a problem that needs to be addressed.
Itâs generally good to avoid collisions with hashes - e.g. if youâd just use reduce(xor, transcode(UInt8, "foo")) youâd get the same as reduce(xor, transcode(UInt8, "oof")):
This is because xor is symmetric. The same is of course true with +, so the usual fix is to make this non-symmetric, e.g. with hash(a)*3 + hash(b). See this StackOverflow answer for more information.
The one-arg form falls back to hash(x, zero(UInt)), which in turn falls back to hash(x, h::UInt) = hash_uint(3h - objectid(x)). objectid in turn does take the type into account:
julia> struct Foo end
julia> struct Bar end
julia> hash(Foo())
0x9e2c647635155a8c
julia> hash(Bar())
0x5669dd55187f2041
Note that there are some specializations for e.g. integers in hash, such that hash(1) === hash(0x1).
xor is useful when you want to exploit its symmetry. I used xor(see here) when writing the hash function for a type that represents an undirected graph edge:
julia> using GraphTypes
julia> hash(Edge(1, 2)) == hash(Edge(2, 1))
true
Another reason:
xor is more easily implemented in hardware (much fewer transistors or âareaâ) and sometimes has smaller latency (because no carry calculation circuitry needed). Therefore it was used often in older designs and very high throughput designs and⌠inertia does the rest.