Custom hash function in dict is very slow?

pjentsch0 · April 20, 2021, 6:12pm

why is using a custom hash function in a dict so slow?

using BenchmarkTools
import Base.hash,Base.isequal
rtuple() = (trunc(Int,rand()*10_000),trunc(Int,rand()*10_000)) #make random tuple in [0,10_000)

function Base.hash(t::Tuple{Int,Int})
    a = t[1] % UInt64
    b = t[2] % UInt64
    return ifelse(t[1] < t[2], (b << 32) | a, (a << 32) | b) % UInt
end
function Base.isequal(a::Tuple{Int,Int},b::Tuple{Int,Int})
    return min(a[1],a[2]) == min(b[1],b[2]) && max(a[1],a[2]) == max(b[1],b[2]) 
end
function addkeys(d,l)
    for t in l
        d[t] = 1
    end
end
function bench()
    l = [rtuple() for i in 1:100_000]
    @btime hash.($l)
    a = hash.(l)

    display(length(unique(a))) #not many collisions
    d = Dict{Tuple{Int,Int},Int}()
    
    @btime addkeys($d,$l)

end

bench()

output:

  37.160 μs (2 allocations: 781.33 KiB)
99906
  3.634 s (0 allocations: 0 bytes))

with the default hash function

  399.578 μs (2 allocations: 781.33 KiB)
99960
  1.445 ms (0 allocations: 0 bytes)

So, you can see that this hash function is very fast (10x faster than the default hash), and does not produce more collisions. However, it is 100x slower than the default hash when used in a Dict.

I have tried defining a custom type, as stated in this discourse thread, but it does not change the results. I also don’t think that is needed here as hash and isequal do dispatch correctly to Tuple{Int,Int} because Base defines a more general function.

Thanks

edit: I should mention that this is not my commutative algorithm, it comes from this stackoverflow post.

pjentsch0 · April 20, 2021, 6:22pm

Gauthier Soleilhac found on slack that there are actually many collisions with this function.

That raises the question, it should be possible to use a commutative hash of int tuples without two orders of magnitude slowdown?

foobar_lv2 · April 20, 2021, 6:36pm

I have tried defining a custom type, as stated in this discourse thread, but it does not change the results. I also don’t think that is needed here as hash and isequal do dispatch correctly to Tuple{Int,Int} because Base defines a more general function.

The issue is not that dispatch is wrong; the issue is that some other unsuspecting code may test (a,b) == (c, d) for some tuple of integers. Your redefinition of Base.isequal now changes the behavior of this code. This is called type piracy.

… collisions …

The issue is that julia hash has a different contract than C++ hash: In C++, the hash function is supposed to minimize collisions, and nothing else. So e.g. hash(x::Int)::Int = x would be a reasonable hash in C++. This means that you cannot simply truncate the hash provided by an unknown user-provided type, you need to apply a random function to this hash (like e.g. a hash function). Imo this is bloody awful API design.

In julia, the contract of hash is that it should (1) minimize collisions, and (2) be reasonably random.

This is important, because Dict addresses its hashtable by just truncating to the lower bits of your user-provided hash-value (how many bits are taken depends on the size of the hashtable). Since your hashtable is pretty small, only the lower bits of min(t[1], t[2]) will be used, and max(t[1], t[2]) will not go into the actual adddressing.

pjentsch0 · April 20, 2021, 6:44pm

The issue is not that dispatch is wrong; the issue is that some other unsuspecting code may test (a,b) == (c, d) for some tuple of integers. Your redefinition of Base.isequal now changes the behavior of this code. This is called type piracy.

Right. That makes sense.

In julia, the contract of hash is that it should (1) minimize collisions, and (2) be reasonably random.

Ah I see. Thanks!

Topic		Replies	Views
Dictionary with custom hash function General Usage question , dictionary , hash	7	2505	November 23, 2023
Different Dicts hashing to the same value General Usage question	6	1258	January 13, 2017
Hash of Dict with custom type as keys General Usage dictionary	4	1363	September 20, 2021
Same types have different hash General Usage question	1	333	November 5, 2019
Why does my hash function return a different value to the built in hash function? General Usage	7	663	March 16, 2022

Custom hash function in dict is very slow?

Related topics