Hash collision with small vectors

mxhbl · August 19, 2025, 9:08am

Hi everyone,

I recently ran into a hash collision with very small vectors (I’m on Julia 11.2):

julia> a = [0x0000000000000080, 0x0000000000100000, 0x0000000000000400, 0x0000000000000100]
4-element Vector{UInt64}:
 0x0000000000000080
 0x0000000000100000
 0x0000000000000400
 0x0000000000000100

julia> b = [0x0000000000000100, 0x0000000000100000, 0x0000000000000080, 0x0000000000000400]
4-element Vector{UInt64}:
 0x0000000000000100
 0x0000000000100000
 0x0000000000000080
 0x0000000000000400

julia> hash(a) == hash(b)
true

Now I know that the Julia hash function is not super safe against collisions and is more designed for speed, but I was still a bit surprised to run into a collision with length four vectors. Is this to be expected? Does the fact that all elements of a and b are powers of two, and that b is a permutation of a play a role?

More generally, are there any easy ways to get a more collision resistant hash function? I tried to use the SHA stdlib in the past for this, but I found using SHA to hash Julia objects quite awkward. I’d appreciate any advice!

technocrat · August 19, 2025, 10:06am

hash() doesn’t take into account the ordering of its arguments. To make it you can pass them as Tuples


julia> a = [0x0000000000000080, 0x0000000000100000, 0x0000000000000400, 0x0000000000000100]
4-element Vector{UInt64}:
 0x0000000000000080
 0x0000000000100000
 0x0000000000000400
 0x0000000000000100

julia> b = [0x0000000000000100, 0x0000000000100000, 0x0000000000000080, 0x0000000000000400]
4-element Vector{UInt64}:
 0x0000000000000100
 0x0000000000100000
 0x0000000000000080
 0x0000000000000400

julia> hash(Tuple(a)) == hash(Tuple(b))
false

eldee · August 19, 2025, 10:19am

This is incorrect:

julia> hash(a) == hash(sort(a))
false

Looking into the source code, the only differences between hash(::AbstractArray) (for short length) and hash(::Tuple) are that

they use different seeds (0x7e2d6fb6448beb77 vs 0x77cfa1eef01bca90 for 64-bit Julia)
hash(::AbstractArray) also takes into account the axes
hash(::Tuple) traverses the elements in reverse order, while hash(::AbstractArray) uses forward order.

Simplified source code

function my_hash(a::Vector)
    h = 0x7e2d6fb6448beb77
    h = hash((1,), h)
    h = hash((length(a),), h)
    for x in a
        h = hash(x, h)
    end
    return h
end

function my_hash(t::Tuple)  
    # (This is actually more lines of code than the recursive real version, 
    #  but it makes it easier to contrast to the Vector version)
    h = 0x77cfa1eef01bca90
    for x in reverse(t)
        h = hash(x, h)
    end
    return h
end

julia> my_hash(a) == hash(a) && my_hash(Tuple(a)) == hash(Tuple(a))
true

Benny · August 19, 2025, 10:34am

julia> hash(Tuple(reverse(a))) == hash(Tuple(reverse(b)))
true

Maybe, but for this example at least, it’s the only collision among all permutations:

julia> using Combinatorics

julia> length(permutations(a))
24

julia> length(unique([hash(x) for x in permutations(a)]))
23

Trying rand(UInt, i) several times over i in 1:8 hasn’t been able to find another example.

adienes · August 19, 2025, 10:37am

note that this specific MWE will no longer collide in 1.13 (since hash values will change)

Does the fact that all elements of a and b are powers of two, and that b is a permutation of a play a role?

yes, probably. although the new algorithm continues to be designed for speed and is still not collision-resistant (in the cryptographic sense)

depending on your needs, you may be able to use objectid as a hash function, but it won’t satisfy the same properties w.r.t. a correspondence to == as hash does

mxhbl · August 19, 2025, 11:01am

Thanks for all your replies.
I think the collision is related to this line in Base:

# hashing.jl, line 87
hash(x::UInt64, h::UInt) = hash_uint64(x) - 3h

Changing the shift from 3 to e.g. 5 seems to resolve the collision (but probably causes other arrays to collide):

function demohash(a::Vector{<:UInt}, h::UInt=zero(UInt); n)
    # Roughly corresponds what happens in Base.hash, modulo hash seeds and array axes
    for x in a
        h = Base.hash_uint64(x) - n*h # Base uses n == 3
    end
    return h
end


julia> demohash(a; n=3) == demohash(b; n=3)
true

julia> demohash(a; n=5) == demohash(b; n=5)
false

mxhbl · August 19, 2025, 11:05am

depending on your needs, you may be able to use objectid as a hash function, but it won’t satisfy the same properties w.r.t. a correspondence to == as hash does

Thanks for this. I don’t think objectid will work for my current purpose (I am pretty sure I need the correspondence with ==), but I will keep it in the back of my head for the future.

adienes · August 19, 2025, 11:22am

ah yeah. well luckily that’s also addressed in 1.13. it will become hash(x - 3h) so folding chains will mix better (note that the linear part moves inside the hash call)

eldee · August 19, 2025, 11:30am

Interestingly, the hash equality of a and b is invariant to the initial hash:

julia> h = rand(UInt64); hash(a, h) == hash(b, h)
true

stevengj · August 19, 2025, 11:49am

The cryptographic hashes all work on byte streams, so you need to serialize an object into a stream of raw bytes before you can hash it.

For example, you could use the Serialization stdlib for this:

import Serialization, SHA
function myhash(x)
    buf = IOBuffer()
    Serialization.serialize(buf, x)
    reinterpret(UInt64, SHA.sha256(take!(buf)))[1]
end

(Of course, this won’t be particularly fast, but it should be cryptographically strong for a 64-bit hash. You can use UInt128 to have more bits, or even take all 256 bits. Serialization is also not guaranteed to give the same byte stream across Julia versions, so this doesn’t give a version-stable hash if you need that.)

sgaure · August 19, 2025, 11:56am

For vectors of bit types you can also reinterpret it directly to avoid serialization:

a = rand(Int, 4)
reinterpret(UInt64, SHA.sha256(reinterpret(UInt8, a)))[1]

stevengj · August 19, 2025, 12:48pm

This actually doesn’t save that much time, and it makes the function much less general.

myhash2(a) = reinterpret(UInt64, SHA.sha256(reinterpret(UInt8, a)))[1]

gives:

julia> a = rand(Int, 4); @btime hash($a); @btime myhash($a); @btime myhash2($a);
  3.750 ns (0 allocations: 0 bytes)
  253.157 ns (20 allocations: 1.18 KiB)
  163.088 ns (7 allocations: 368 bytes)

julia> a = rand(Int, 400); @btime hash($a); @btime myhash($a); @btime myhash2($a);
  270.636 ns (0 allocations: 0 bytes)
  6.208 μs (21 allocations: 4.38 KiB)
  6.092 μs (7 allocations: 368 bytes)

Using SHA-256 is ≈ 50x slower than the insecure hash.

mxhbl · August 19, 2025, 1:11pm

stevengj:

For example, you could use the Serialization stdlib for this:

import Serialization, SHA
function myhash(x)
    buf = IOBuffer()
    Serialization.serialize(buf, x)
    reinterpret(UInt64, SHA.sha256(take!(buf)))[1]
end

Thanks for this suggestion! This is basically what I also used (I think I originally got this from a similar code snippet you posted in a different thread). This worked well for me if the arrays to be hashed are large, but unfortunately I often have to hash a large number of small arrays. For context: I am working on a wrapper for the graph isomorphism tool nauty. For small graphs, the call to myhash can take more time than computing the isomorphism class.

Thinking about this a bit more, I also tried this for vectors of bit types:

hashobj(x) = objectid(Tuple(x))

But this looks kind of dangerous and doesn’t really lead to a meaningful speedup compared to the SHA based myhash except for very small vectors:

julia> a = rand(Int, 4); @btime hash($a); @btime hashobj($a); @btime myhash($a);
  4.625 ns (0 allocations: 0 bytes)
  106.481 ns (5 allocations: 112 bytes)
  275.559 ns (20 allocations: 1.18 KiB)

julia> a = rand(Int, 40); @btime hash($a); @btime hashobj($a); @btime myhash($a);
  38.268 ns (0 allocations: 0 bytes)
  844.971 ns (41 allocations: 976 bytes)
  960.400 ns (21 allocations: 1.49 KiB)

I guess the best solution for me would be to use the SHA based myhash for large vectors, and maybe use something similar to the 1.13 hash implementation for small vectors? (For my very specific graph isomorphism use case, I may also be able to use the hashing tools shipped with the new version of nauty.)

adienes · August 19, 2025, 1:42pm

what about GitHub - tecosaur/KangarooTwelve.jl: Hashing with hopping ? it’s possibly faster than the SHA + serialization approach and only requires that you can convert your data to a Vector{<:Unsigned} (which it looks like it already is?)

jling · August 19, 2025, 3:33pm

is very fast. (for less dependency and less speed, GitHub - Moelf/XXHashNative.jl: Pure Julia implementation of one-shot xxHash3_64)

stevengj · August 19, 2025, 4:09pm

Quantitatively, it seems almost as fast as hash for a small Int array:

julia> a = rand(Int, 4); @btime hash($a); @btime xxh64($a);
  3.750 ns (0 allocations: 0 bytes)
  4.291 ns (0 allocations: 0 bytes)

Of course, this is not a cryptographic hash, but the authors claim it has low-collision properties.

mxhbl · August 20, 2025, 2:33pm

This looks very promising! Thanks!

Topic		Replies	Views
Use of MurmurHash3 for hashing strings Internals & Design	29	8713	March 30, 2018
Stable hashing across Julia versions General Usage	7	1362	September 23, 2020
Alternatives to `Base.hash`? General Usage	14	1035	April 11, 2022
Is this expected `hash`ing behaviour?```julia> a = [rand(0:1, 200) for i in 1: General Usage	1	261	February 21, 2021
Inconsistent hashing of types containing `Set` or `Array` fields General Usage	5	564	March 22, 2018

Hash collision with small vectors

Related topics