Really Fast Hash Set and Map

nordlow · August 28, 2018, 1:35pm

Does Julia have a fast generic implementation of hashed data structures, that is hash table and hash sets?

Is the builtin Dict() as fast as in languages such as C++, Rust and D?

Are there alternatives or examples of collections I could modify to make Julia have such a collection?

I’m currently interested in a hash table/set with open addressing.

StefanKarpinski · August 28, 2018, 1:46pm

How are you measuring this? Can you share some benchmark results.

ExpandingMan · August 28, 2018, 1:59pm

julia> using BenchmarkTools

julia> dict = Dict((1:10^5) .=> rand(10^5));

julia> @benchmark dict[100]
BenchmarkTools.Trial: 
  memory estimate:  16 bytes
  allocs estimate:  1
  --------------
  minimum time:     23.364 ns (0.00% GC)
  median time:      23.941 ns (0.00% GC)
  mean time:        30.599 ns (17.99% GC)
  maximum time:     40.852 μs (99.93% GC)
  --------------
  samples:          10000
  evals/sample:     996

julia> dict = Dict([randstring(10) for i ∈ 1:10^5] .=> rand(10^5));

julia> typeof(ans)
Dict{String,Float64}

julia> @benchmark dict[$(first(keys(dict)))]
BenchmarkTools.Trial: 
  memory estimate:  16 bytes
  allocs estimate:  1
  --------------
  minimum time:     39.205 ns (0.00% GC)
  median time:      39.570 ns (0.00% GC)
  mean time:        46.894 ns (13.25% GC)
  maximum time:     46.967 μs (99.90% GC)
  --------------
  samples:          10000
  evals/sample:     992

That seems pretty damn fast, I’m really doubtful that there are any implementations out there much faster than that except for possibly object ID dicts (which are available in Julia, by the way). This is also on a rather slow laptop CPU. Granted this doesn’t necessarily guarantee that every object has efficient hashing, perhaps there are some common objects which have poor hashing?

nordlow · August 28, 2018, 2:17pm

What does dict[100] do?

nordlow · August 28, 2018, 2:17pm

Ahh, dict[100] looks up value of element with key 100. I could have guessed that…

nordlow · August 28, 2018, 2:17pm

How can I check which hashing that is used here?

nordlow · August 28, 2018, 2:17pm

The fastest hash-table lookups with open addressing on C++, Rust and D with FNV-hasher are around 3-10 ns.

Sukera · August 28, 2018, 2:25pm

It really depends on what you hash - there’s different implementations for different types. You can check them here and here. There may be other reasons beside performance for why each implementation was chosen - @StefanKarpinski may be able to shed some light on that, based on the git blame for those files.

gdkrmr · August 28, 2018, 2:26pm

I have used quite large Dicts and ObjectIdDicts and they were reasonably fast for the task (I have not benchmarked other hash tables to compare) with the exception of rehashing ObjectIdDict which takes hours when it reaches a certain size. sizehint! is your friend here if you know the approximate size of your Dict.

ExpandingMan · August 28, 2018, 2:35pm

Hm, note also

julia> A = rand(5);

julia> using BenchmarkTools

julia> @benchmark A[$1]
BenchmarkTools.Trial: 
  memory estimate:  16 bytes
  allocs estimate:  1
  --------------
  minimum time:     20.913 ns (0.00% GC)
  median time:      21.449 ns (0.00% GC)
  mean time:        28.138 ns (19.46% GC)
  maximum time:     40.808 μs (99.93% GC)
  --------------
  samples:          10000
  evals/sample:     996

I seem to remember at some point understanding the reason why that seems slow, but right now I can’t think of what that might be and this strikes me as slow. Perhaps I’m just wrong and that’s a reasonable number.

On the bright side, it seems like hashing alone is really damn fast.

Sukera · August 28, 2018, 2:37pm

yep, hashing alone is fast:

julia> @benchmark hash(rand(Int))
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     6.000 ns (0.00% GC)
  median time:      9.000 ns (0.00% GC)
  mean time:        9.426 ns (0.00% GC)
  maximum time:     387.000 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

Note though, this includes the overhead of calling rand(Int). With a static number we get this (not really representative thing):

julia> x = rand(Int)
3888178585692376031

julia> @benchmark hash($x)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     3.000 ns (0.00% GC)
  median time:      4.000 ns (0.00% GC)
  mean time:        3.910 ns (0.00% GC)
  maximum time:     27.000 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

traktofon · August 28, 2018, 2:50pm

dict and A should also be interpolated when benchmarking:

julia> dict = Dict((1:10^5) .=> rand(10^5));

julia> @btime dict[100];
  19.629 ns (1 allocation: 16 bytes)

julia> @btime $dict[100];
  5.420 ns (0 allocations: 0 bytes)

julia> A=rand(5);

julia> @btime A[1];
  14.997 ns (1 allocation: 16 bytes)

julia> @btime $A[1];
  1.303 ns (0 allocations: 0 bytes)

nordlow · August 28, 2018, 5:17pm

Cool. Can somebody please explain what interpolated means in this context. I only know about numerical interpolation and string interpolation in Julia. I am guessing it has to do with scope and local variables vs global variables.

Further, what’s the difference between @btime and normal @time in this context.

Also, how can I check what hashing algorithm that is used by default for different types.

ExpandingMan · August 28, 2018, 5:23pm

@time literally just puts a timer around your code, so you might be including times other than the execution of your code like, especially, compile time. You’ve probably noticed that @time tends to be much faster after you run it on the same function a couple of times. @btime ensures that it only measures the actual execution of code.

In this context and interpolated value means a value that was computed before it goes into benchmarking. For example @btime x[$(rand(1:3))] executes rand before benchmarking, stores the value and uses that value as input to the benchmark.

Also, the @edit macro is extremely useful. Try, for example @edit rand().

Evizero · August 28, 2018, 5:30pm

In this context its expression interpolation. Its a metaprogramming concept. Here $A basically says “instead of using a reference to variable A, just replace it with its value”.

julia> A = 2
2

julia> :(A + 2)
:(A + 2)

julia> :($A + 2)
:(2 + 2)

I think a good first impression is given here: GitHub - JuliaCI/BenchmarkTools.jl: A benchmarking framework for the Julia language

stevengj · August 28, 2018, 5:52pm

Yes, and in particular it is used by @btime to allow you to benchmark expressions involving global variables without paying the usual performance price in Julia for working with globals.

Without interpolation, an expression like @btime dict[3] is including the time to look up the type of dict and determine (dynamically) the correct getindex method to call for getindex(dict, 3) (= dict[3]), because dict is a non-constant global variable. This is typically not what you want, because real performance-sensitive code in Julia will normally occur in functions and act on local variables whose types are known to the compiler. By doing @btime $dict[3], it interpolates the value of dict into the expression, and allows the @btime macro to time it with the type known at compile time.

Sukera · August 28, 2018, 6:03pm

Check the links to github I posted before, you should find the implementations there. As far as I’m aware, there’s no formal description of which algorithm is used precisely and why.

kristoffer.carlsson · August 29, 2018, 11:46am

The is still a need to substantiate

DNF · August 29, 2018, 11:56am

My impression is that it was merely a benchmarking user issue.

StefanKarpinski · August 29, 2018, 12:14pm

It’s worth looking into thoroughly though.

Topic		Replies	Views
Benchmarking Dict against C++ std::unordered_map Internals & Design performance , dictionary	6	733	July 3, 2022
What is the fastest data type out of these? General Usage question	18	1200	October 1, 2018
How efficient is dictionary lookup in Julia? Performance dictionary	6	5039	July 30, 2019
Dictionary with custom hash function General Usage question , dictionary , hash	7	2484	November 23, 2023
Hashing for big structs is slow - any alternative? Performance hash	12	1251	January 25, 2021

Really Fast Hash Set and Map

Related topics