Now I know that the Julia hash function is not super safe against collisions and is more designed for speed, but I was still a bit surprised to run into a collision with length four vectors. Is this to be expected? Does the fact that all elements of a and b are powers of two, and that b is a permutation of a play a role?
More generally, are there any easy ways to get a more collision resistant hash function? I tried to use the SHA stdlib in the past for this, but I found using SHA to hash Julia objects quite awkward. I’d appreciate any advice!
Looking into the source code, the only differences between hash(::AbstractArray) (for short length) and hash(::Tuple) are that
they use different seeds (0x7e2d6fb6448beb77 vs 0x77cfa1eef01bca90 for 64-bit Julia)
hash(::AbstractArray) also takes into account the axes
hash(::Tuple) traverses the elements in reverse order, while hash(::AbstractArray) uses forward order.
Simplified source code
function my_hash(a::Vector)
h = 0x7e2d6fb6448beb77
h = hash((1,), h)
h = hash((length(a),), h)
for x in a
h = hash(x, h)
end
return h
end
function my_hash(t::Tuple)
# (This is actually more lines of code than the recursive real version,
# but it makes it easier to contrast to the Vector version)
h = 0x77cfa1eef01bca90
for x in reverse(t)
h = hash(x, h)
end
return h
end
note that this specific MWE will no longer collide in 1.13 (since hash values will change)
Does the fact that all elements of a and b are powers of two, and that b is a permutation of a play a role?
yes, probably. although the new algorithm continues to be designed for speed and is still not collision-resistant (in the cryptographic sense)
depending on your needs, you may be able to use objectid as a hash function, but it won’t satisfy the same properties w.r.t. a correspondence to == as hash does
Thanks for all your replies.
I think the collision is related to this line in Base:
# hashing.jl, line 87
hash(x::UInt64, h::UInt) = hash_uint64(x) - 3h
Changing the shift from 3 to e.g. 5 seems to resolve the collision (but probably causes other arrays to collide):
function demohash(a::Vector{<:UInt}, h::UInt=zero(UInt); n)
# Roughly corresponds what happens in Base.hash, modulo hash seeds and array axes
for x in a
h = Base.hash_uint64(x) - n*h # Base uses n == 3
end
return h
end
depending on your needs, you may be able to use objectid as a hash function, but it won’t satisfy the same properties w.r.t. a correspondence to == as hash does
Thanks for this. I don’t think objectid will work for my current purpose (I am pretty sure I need the correspondence with ==), but I will keep it in the back of my head for the future.
ah yeah. well luckily that’s also addressed in 1.13. it will become hash(x - 3h) so folding chains will mix better (note that the linear part moves inside the hash call)
The cryptographic hashes all work on byte streams, so you need to serialize an object into a stream of raw bytes before you can hash it.
For example, you could use the Serialization stdlib for this:
import Serialization, SHA
function myhash(x)
buf = IOBuffer()
Serialization.serialize(buf, x)
reinterpret(UInt64, SHA.sha256(take!(buf)))[1]
end
(Of course, this won’t be particularly fast, but it should be cryptographically strong for a 64-bit hash. You can use UInt128 to have more bits, or even take all 256 bits. Serialization is also not guaranteed to give the same byte stream across Julia versions, so this doesn’t give a version-stable hash if you need that.)
Thanks for this suggestion! This is basically what I also used (I think I originally got this from a similar code snippet you posted in a different thread). This worked well for me if the arrays to be hashed are large, but unfortunately I often have to hash a large number of small arrays. For context: I am working on a wrapper for the graph isomorphism tool nauty. For small graphs, the call to myhash can take more time than computing the isomorphism class.
Thinking about this a bit more, I also tried this for vectors of bit types:
hashobj(x) = objectid(Tuple(x))
But this looks kind of dangerous and doesn’t really lead to a meaningful speedup compared to the SHA based myhash except for very small vectors:
I guess the best solution for me would be to use the SHA based myhash for large vectors, and maybe use something similar to the 1.13 hash implementation for small vectors? (For my very specific graph isomorphism use case, I may also be able to use the hashing tools shipped with the new version of nauty.)
what about GitHub - tecosaur/KangarooTwelve.jl: Hashing with hopping ? it’s possibly faster than the SHA + serialization approach and only requires that you can convert your data to a Vector{<:Unsigned} (which it looks like it already is?)