Small <: AbstractDict implementation

question

#1

Is there a package implementing a mutable <: AbstractDict optimized for a small number of elements?

In my application, I would insert 1 element with

get!(() -> new_record(), dict, key)

in about 70–90% of cases, occasionally 2–3, and rarely more (there is no theoretical upper bound, and/or calculating it is impractical).

I thought that a Vector{Pair}, either ordered or unordered, would be an optimal implementation. Just checking before I start a package.


#2

What’s the motivation for this? Why not use a regular Dict? Performance (lookup, or insertion)? Memory usage?

If you’re after maximum performance, one idea is to base the type around a Vector{Pair} like you mention, but hardcode the first 1, 2, or 3 elements (own fields in the type), which you test before you start iterating over the vector.


#3

Mostly performance and memory usage. That said, Dict is very well written so I may just use it.


#4

This is something I’ve also thought would be occasionally handy - I have places where very small dictionaries (1-5 entries) are created in loops, and was curious to know how much overhead Dict had but never had the time to explore it.


#5

I think the performance of Dict is pretty good… and by paying close attention to how it works, the performance can be excellent. Consider this example:

struct MyKey a::Int end
struct MyValue a::Int end

new_record() = MyValue(rand(Int))

d = Dict{MyKey,MyValue}()

A first benchmark yields:

julia> @btime get!(new_record, $d, MyKey(4))
  38.907 ns (1 allocation: 16 bytes)
MyValue(3327136004207544748)

Which is quite disappointing. The first optimization to do is to speed up the hash code:

Base.hash(x::MyKey) = x.a % UInt

New benchmark:

julia> @btime get!(new_record, $d, MyKey(4))
  8.354 ns (0 allocations: 0 bytes)
MyValue(7805519516943331301)

Better, but not great (24 clock cycles on my machine). Now, there’s a trick we can do to achieve much better lookup speeds, by using the more efficient get instead of get!, and wrapping new_record in its own function:

julia> add_new_record(k) = d[k] = new_record();

julia> get_record!(h, k) = get(() -> add_new_record(k), h, k);

julia> @btime get_record!($d, MyKey(4))
  2.253 ns (0 allocations: 0 bytes)
MyValue(7805519516943331301)

2.25 ns, or 6.5 clock cycles… boosted a bit by @btime (warm caches and branch predictors), but still. This approach is a bit hacky and unflexible however, since it binds the dictionary d to get_record (I didn’t find a way to avoid that, while keeping the same performance). If we don’t mind using the internals of Dict (unfortunately, I often find that I have to), a more flexible approach is to define:

@inline function fast_get!(default::Base.Callable, h::Dict{K,V}, key::K) where {K,V}
    index = Base.ht_keyindex(h, key)
    index < 0 ? get!(default, h, key) : @inbounds h.vals[index]::V
end

Which results in the same fast lookups:

julia> @btime fast_get!(new_record, $d, MyKey(4))
  2.257 ns (0 allocations: 0 bytes)

With this performance, I doubt that there’s a need to try to optimize it further. With some exceptions… memory usage being one of them, or if it’s expensive/difficult to calculate hash codes.


#6

I cannot reproduce.

julia> function dict_test(keys)
       h = Dict{eltype(keys), Float64}()
       s = 0.0
       for k in keys
       s += get!(rand, h, k)
       end
       s
       end
dict_test (generic function with 1 method)

julia> function dict_test2(keys)
       h = Dict{eltype(keys), Float64}()
       s = 0.0
       for k in keys
       s += fast_get!(rand, h, k)
       end
       s
       end
julia> keys = rand(1:10_000, 10_000);
julia> @btime dict_test($keys)
  400.073 μs (24 allocations: 364.61 KiB)
4994.545654939051

julia> @btime dict_test2($keys)
  476.200 μs (24 allocations: 364.61 KiB)
5000.061961199275

However, something useful if you need to juggle keys between a lot of dictionaries:

struct Hash_carrier{T}
    h::Int32
    payload::T
 end
@inline Hash_carrier(payload) = Hash_carrier(Base.hashindex(payload, 1<<32) %Int32,  payload)
Base.indexed_iterate(z::Hash_carrier, i, s=nothing) = (getfield(z,i),nothing)
@inline hashindex(key::Hash_carrier, sz) = (((key.h %Int) & (sz-1)) + 1)

Take care to avoid unneeded structure padding. I actually had

struct Hash_carrier{T}
    h::Int32
    cid::Int32
    payload::T
 end

where payload is 8-byte aligned bitstype. Hence, carrying the hash did not cost me any memory at all (otherwise these juicy 4 bytes would have been wasted as padding anyway).

This speeds up rehashing, and is nice if your algorithm requires a lot of filtering between dictionaries and vectors.

It would be super nice if there was a way to get the packing automatically, e.g.

julia> tt = Hash_carrier(0%Int32, (1%Int32, 1, 2, 3));
julia> sizeof(tt) #can be done in 32 bytes
40
julia> sizeof((0%Int32, (1%Int32, 1, 2, 3)))
40
julia> sizeof((0%Int32, 1%Int32, 1, 2, 3))
32

#7

That set of keys results in relatively few lookups where the key already exists. The fast_get! method above was highly optimized for that case (which I assumed to be the case if the dictionary has only 1-3 entries, but doesn’t have to be of course).

Moreover, the default hash code for Int is not very efficient. If you want to work efficiently with Dicts you need to provide a fast hash code. Using your code above, this will show a huge difference:

julia> Base.hash(x::Int) = x % UInt;

julia> keys = rand(1:100, 10_000);

julia> @btime dict_test($keys);
  80.403 μs (10 allocations: 6.58 KiB)

julia> @btime dict_test2($keys);
  22.715 μs (10 allocations: 6.58 KiB)

If you don’t wish to change the hash code globally for Int, you can wrap it in another type. (I wish one could pass a custom hash function to Dict.)


#8

I see. That is relevant if hashing is almost free.

However, it is only due to inlining decisions: Fix that by @eval Base @inline function get!(default::Callable, h::Dict{K,V}, key::K) where V where K and @eval Base @inline function ht_keyindex2!(h::Dict{K,V}, key) where V where K.

The Base code is not really slower, and ht_keyindex2! is the correct approach (hash+probe only once).

I once again wish it was possible to do call-site @inline / @noinline annotations that override callee annotations.


#9

I think this can often be done, since for immutable types, a field can be added to store the hash (if there’s no existing field that can be used), just like in your Hash_carrier type.

Btw, I didn’t get the reason for overloading the internal hashindex instead of hash for Hash_carrier?

Yes, I reckoned that it was inlining that made the difference, since the code is almost identical. But copying and pasting all that code from Dict into my project and prepending it with @inline (which I think is what you mean, or am I misunderstanding?) is not a fix to me. It’s a major hack, that would break the next time Dict is refactored. I’d rather copy all of dict.jl then, to be on the safe side. (And yes! Overriding @inline from the caller is something I’ve long desired as well.)

It also seems like inlining everything can reduce performance. I haven’t looked into why, perhaps code locality, or some compiler effect, but fast_get! becomes quite a bit slower when get! is inlined (even though get! is never called). MWE below.

d = Dict{Int, Float64}()

Base.hash(x::Int) = x % UInt

@inline function fast_get!(default::Base.Callable, h::Dict{K,V}, key::K) where {K,V}
    index = Base.ht_keyindex(h, key)
    index < 0 ? get!(default, h, key) : @inbounds h.vals[index]::V
end

Timing:

julia> @btime get!(rand, $d, 4);
  8.017 ns (0 allocations: 0 bytes)

julia> @btime fast_get!(rand, $d, 4);
  2.466 ns (0 allocations: 0 bytes)

Now inline the relevant methods from dict.jl:

@eval Base begin
@inline function ht_keyindex2!(h::Dict{K,V}, key) where V where K
    ... 47 lines ...
end

@inline function get!(default::Callable, h::Dict{K,V}, key::K) where V where K
    ... 17 lines ...
end
end

Timing again:

julia> @btime get!(rand, $d, 4);
  6.376 ns (0 allocations: 0 bytes)

julia> @btime fast_get!(rand, $d, 4);
  6.699 ns (0 allocations: 0 bytes)

The dict_test2 benchmark from above also suffers, but not as much (I see a ~20 % reduction in performance).


#10

No real reason. Overloading hash is probably the cleaner choice. Since I was dealing with the internals anyway, this was somehow more clear to me.

True. The “fix” would be to define a inline_get! that does the job, by copy-pasting from Base, which breaks the next time Dict is refactored. The @eval Base is an even dirtier way to quickly do this on the REPL, without having to deal with the different namespace (copy-pasted code from Base often needs foo->Base.foo qualifiers). But then, probing twice to work around inline heuristics and lacking call-site control over inlining is painful.

The heuristic really looks incorrect to me: ht_keyindex and ht_keyindex2!, and get and get! should have the same inlining properties.