Really Fast Hash Set and Map

Does Julia have a fast generic implementation of hashed data structures, that is hash table and hash sets?

Is the builtin Dict() as fast as in languages such as C++, Rust and D?

Are there alternatives or examples of collections I could modify to make Julia have such a collection?

I’m currently interested in a hash table/set with open addressing.

2 Likes

How are you measuring this? Can you share some benchmark results.

2 Likes
julia> using BenchmarkTools

julia> dict = Dict((1:10^5) .=> rand(10^5));

julia> @benchmark dict[100]
BenchmarkTools.Trial: 
  memory estimate:  16 bytes
  allocs estimate:  1
  --------------
  minimum time:     23.364 ns (0.00% GC)
  median time:      23.941 ns (0.00% GC)
  mean time:        30.599 ns (17.99% GC)
  maximum time:     40.852 μs (99.93% GC)
  --------------
  samples:          10000
  evals/sample:     996

julia> dict = Dict([randstring(10) for i ∈ 1:10^5] .=> rand(10^5));

julia> typeof(ans)
Dict{String,Float64}

julia> @benchmark dict[$(first(keys(dict)))]
BenchmarkTools.Trial: 
  memory estimate:  16 bytes
  allocs estimate:  1
  --------------
  minimum time:     39.205 ns (0.00% GC)
  median time:      39.570 ns (0.00% GC)
  mean time:        46.894 ns (13.25% GC)
  maximum time:     46.967 μs (99.90% GC)
  --------------
  samples:          10000
  evals/sample:     992

That seems pretty damn fast, I’m really doubtful that there are any implementations out there much faster than that except for possibly object ID dicts (which are available in Julia, by the way). This is also on a rather slow laptop CPU. Granted this doesn’t necessarily guarantee that every object has efficient hashing, perhaps there are some common objects which have poor hashing?

2 Likes

What does dict[100] do?

Ahh, dict[100] looks up value of element with key 100. I could have guessed that…

How can I check which hashing that is used here?

The fastest hash-table lookups with open addressing on C++, Rust and D with FNV-hasher are around 3-10 ns.

It really depends on what you hash - there’s different implementations for different types. You can check them here and here. There may be other reasons beside performance for why each implementation was chosen - @StefanKarpinski may be able to shed some light on that, based on the git blame for those files.

1 Like

I have used quite large Dicts and ObjectIdDicts and they were reasonably fast for the task (I have not benchmarked other hash tables to compare) with the exception of rehashing ObjectIdDict which takes hours when it reaches a certain size. sizehint! is your friend here if you know the approximate size of your Dict.

1 Like

Hm, note also

julia> A = rand(5);

julia> using BenchmarkTools

julia> @benchmark A[$1]
BenchmarkTools.Trial: 
  memory estimate:  16 bytes
  allocs estimate:  1
  --------------
  minimum time:     20.913 ns (0.00% GC)
  median time:      21.449 ns (0.00% GC)
  mean time:        28.138 ns (19.46% GC)
  maximum time:     40.808 μs (99.93% GC)
  --------------
  samples:          10000
  evals/sample:     996

I seem to remember at some point understanding the reason why that seems slow, but right now I can’t think of what that might be and this strikes me as slow. Perhaps I’m just wrong and that’s a reasonable number.

On the bright side, it seems like hashing alone is really damn fast.

1 Like

yep, hashing alone is fast:

julia> @benchmark hash(rand(Int))
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     6.000 ns (0.00% GC)
  median time:      9.000 ns (0.00% GC)
  mean time:        9.426 ns (0.00% GC)
  maximum time:     387.000 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

Note though, this includes the overhead of calling rand(Int). With a static number we get this (not really representative thing):

julia> x = rand(Int)
3888178585692376031

julia> @benchmark hash($x)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     3.000 ns (0.00% GC)
  median time:      4.000 ns (0.00% GC)
  mean time:        3.910 ns (0.00% GC)
  maximum time:     27.000 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000
1 Like

dict and A should also be interpolated when benchmarking:

julia> dict = Dict((1:10^5) .=> rand(10^5));

julia> @btime dict[100];
  19.629 ns (1 allocation: 16 bytes)

julia> @btime $dict[100];
  5.420 ns (0 allocations: 0 bytes)

julia> A=rand(5);

julia> @btime A[1];
  14.997 ns (1 allocation: 16 bytes)

julia> @btime $A[1];
  1.303 ns (0 allocations: 0 bytes)
8 Likes

Cool. Can somebody please explain what interpolated means in this context. I only know about numerical interpolation and string interpolation in Julia. I am guessing it has to do with scope and local variables vs global variables.

Further, what’s the difference between @btime and normal @time in this context.

Also, how can I check what hashing algorithm that is used by default for different types.

@time literally just puts a timer around your code, so you might be including times other than the execution of your code like, especially, compile time. You’ve probably noticed that @time tends to be much faster after you run it on the same function a couple of times. @btime ensures that it only measures the actual execution of code.

In this context and interpolated value means a value that was computed before it goes into benchmarking. For example @btime x[$(rand(1:3))] executes rand before benchmarking, stores the value and uses that value as input to the benchmark.

Also, the @edit macro is extremely useful. Try, for example @edit rand().

1 Like

In this context its expression interpolation. Its a metaprogramming concept. Here $A basically says “instead of using a reference to variable A, just replace it with its value”.

julia> A = 2
2

julia> :(A + 2)
:(A + 2)

julia> :($A + 2)
:(2 + 2)

I think a good first impression is given here: GitHub - JuliaCI/BenchmarkTools.jl: A benchmarking framework for the Julia language

2 Likes

Yes, and in particular it is used by @btime to allow you to benchmark expressions involving global variables without paying the usual performance price in Julia for working with globals.

Without interpolation, an expression like @btime dict[3] is including the time to look up the type of dict and determine (dynamically) the correct getindex method to call for getindex(dict, 3) (= dict[3]), because dict is a non-constant global variable. This is typically not what you want, because real performance-sensitive code in Julia will normally occur in functions and act on local variables whose types are known to the compiler. By doing @btime $dict[3], it interpolates the value of dict into the expression, and allows the @btime macro to time it with the type known at compile time.

6 Likes

Check the links to github I posted before, you should find the implementations there. As far as I’m aware, there’s no formal description of which algorithm is used precisely and why.

1 Like

The is still a need to substantiate

2 Likes

My impression is that it was merely a benchmarking user issue.

It’s worth looking into thoroughly though.

1 Like