Custom hash function in dict is very slow?

why is using a custom hash function in a dict so slow?

using BenchmarkTools
import Base.hash,Base.isequal
rtuple() = (trunc(Int,rand()*10_000),trunc(Int,rand()*10_000)) #make random tuple in [0,10_000)

function Base.hash(t::Tuple{Int,Int})
    a = t[1] % UInt64
    b = t[2] % UInt64
    return ifelse(t[1] < t[2], (b << 32) | a, (a << 32) | b) % UInt
end
function Base.isequal(a::Tuple{Int,Int},b::Tuple{Int,Int})
    return min(a[1],a[2]) == min(b[1],b[2]) && max(a[1],a[2]) == max(b[1],b[2]) 
end
function addkeys(d,l)
    for t in l
        d[t] = 1
    end
end
function bench()
    l = [rtuple() for i in 1:100_000]
    @btime hash.($l)
    a = hash.(l)

    display(length(unique(a))) #not many collisions
    d = Dict{Tuple{Int,Int},Int}()
    
    @btime addkeys($d,$l)

end

bench()

output:

  37.160 μs (2 allocations: 781.33 KiB)
99906
  3.634 s (0 allocations: 0 bytes))

with the default hash function

  399.578 μs (2 allocations: 781.33 KiB)
99960
  1.445 ms (0 allocations: 0 bytes)

So, you can see that this hash function is very fast (10x faster than the default hash), and does not produce more collisions. However, it is 100x slower than the default hash when used in a Dict.

I have tried defining a custom type, as stated in this discourse thread, but it does not change the results. I also don’t think that is needed here as hash and isequal do dispatch correctly to Tuple{Int,Int} because Base defines a more general function.

Thanks

edit: I should mention that this is not my commutative algorithm, it comes from this stackoverflow post.

Gauthier Soleilhac found on slack that there are actually many collisions with this function.

That raises the question, it should be possible to use a commutative hash of int tuples without two orders of magnitude slowdown?

I have tried defining a custom type, as stated in this discourse thread, but it does not change the results. I also don’t think that is needed here as hash and isequal do dispatch correctly to Tuple{Int,Int} because Base defines a more general function.

The issue is not that dispatch is wrong; the issue is that some other unsuspecting code may test (a,b) == (c, d) for some tuple of integers. Your redefinition of Base.isequal now changes the behavior of this code. This is called type piracy.

… collisions …

The issue is that julia hash has a different contract than C++ hash: In C++, the hash function is supposed to minimize collisions, and nothing else. So e.g. hash(x::Int)::Int = x would be a reasonable hash in C++. This means that you cannot simply truncate the hash provided by an unknown user-provided type, you need to apply a random function to this hash (like e.g. a hash function). Imo this is bloody awful API design.

In julia, the contract of hash is that it should (1) minimize collisions, and (2) be reasonably random.

This is important, because Dict addresses its hashtable by just truncating to the lower bits of your user-provided hash-value (how many bits are taken depends on the size of the hashtable). Since your hashtable is pretty small, only the lower bits of min(t[1], t[2]) will be used, and max(t[1], t[2]) will not go into the actual adddressing.

1 Like

The issue is not that dispatch is wrong; the issue is that some other unsuspecting code may test (a,b) == (c, d) for some tuple of integers. Your redefinition of Base.isequal now changes the behavior of this code. This is called type piracy.

Right. That makes sense.

In julia, the contract of hash is that it should (1) minimize collisions, and (2) be reasonably random.

Ah I see. Thanks!