Performance of Dictionaries.jl vs Base.Dict

ParadaCarleton · December 22, 2022, 4:58pm

Lots of places; Dictionaries.jl’s packages are just much faster than Base Dicts to iterate over. There’s a reason most package designers use them rather than Dicts.

StefanKarpinski · December 23, 2022, 9:53pm

I would be very interested in seeing a direct benchmark showing that Dict is five times slower than Dictionaries.

stevengj · December 24, 2022, 1:58am

Using my benchmark above, it seems that Dictionary is slower than Dict for getindex (mul1! benchmark), but faster for iterating over pairs(dict) (mul2! benchmark).

In particular, this code:

Code

function mul1!(y, D, x)
    for i = 1:length(y)
        s = zero(eltype(y))
        @inbounds for j = 1:length(x)
            s += D[CartesianIndex((i,j))] * x[j]
        end
        y[i] = s
    end
    return y
end

function mul2!(y, D, x)
    y .= 0
    for (k,v) in pairs(D)
        i,j = Tuple(k)
        @inbounds y[i] += v * x[j]
    end
    return y
end

using Dictionaries, BenchmarkTools
A = rand(1000,1000)
x = rand(1000)
y = similar(x)
D = Dict(pairs(A))
D2 = Dictionary(D)

gives:

julia> @btime mul1!($y, $D, $x); @btime mul1!($y, $D2, $x);
  67.968 ms (0 allocations: 0 bytes)
  110.422 ms (0 allocations: 0 bytes)

julia> @btime mul2!($y, $D, $x); @btime mul2!($y, $D2, $x);
  9.907 ms (0 allocations: 0 bytes)
  1.897 ms (0 allocations: 0 bytes)

If you look at the way Dictionary is implemented, it has separate contiguous arrays of the keys and values in the dictionary, which makes it fast to iterate, at the expense of an additional indirection during getindex (via a separate array of indices into the contiguous data).

ParadaCarleton · December 24, 2022, 2:09am

cc @andyferris

jlapeyre · December 24, 2022, 5:59am

Yes. I think Andy writes about this in several places.
Like in Getting Started in the README.

The three main difference to Dict are that it preserves the order of elements, it iterates much faster, and it iterates values rather than key-value pairs.

His JuliaCon talk is very clear.

If you do a lot of insertion and deletion and not much iterating over keys and values, then Dict might be faster. If you do a lot of iterating, and not so much insertion and deletion, then Dictionary

StefanKarpinski · December 25, 2022, 5:05pm

Ah yes, that’s a known trade off for ordered dicts—iteration gets much faster but other operations get a bit slower. Describing that as “Dictionaries are five times faster than Dict” is a bit misleading.

jar1 · December 25, 2022, 6:41pm

Dictonaries.jl also has the tokens API which helps with performance.

Oscar_Smith · December 25, 2022, 6:45pm

yeah, we really need to add the token API to base.

uniment · December 25, 2022, 8:24pm

For a less-gifted individual like me, can you offer a basic example of how to use tokens and how they help?

Oscar_Smith · December 25, 2022, 8:29pm

The idea of a token is that a lot of the time people will write code like the following

D = get_random(dict)
if !(a in keys(D))
    D[a] = 1
end

This looks decent, but requires doing a duplicate lookup in D. Using a token will save a call to hash(a) and may save some lookup costs.

kristoffer.carlsson · December 25, 2022, 8:44pm

There is get! for that pattern. It’s more when you want to update an existing index D[a] += 1 where I don’t think there is a good way to do it now without double lookup.

Oscar_Smith · December 25, 2022, 8:57pm

get! can be used for my case, but since it isn’t lazy it sometimes isn’t a good option.

StefanKarpinski · December 25, 2022, 11:46pm

There’s a lazy method where the value is computed by a function call.

fingolfin · December 26, 2022, 12:01am

AFAIK there is no great way to se a Base.Dict to implement a “lazy accumulator”, like this:

   if haskey(dict, key)
    dict[key] += 1
  else
    dict[key] = 1
  end

This needs three lookups. We can reduce this to two via get!:

    dict[key] = 1 + get!(dict, key, 0)

But I am not aware of way to get it down to 1 (if there is one, I’d love to learn about it).

ToucheSir · December 26, 2022, 12:05am

Add modify! function for lookup/update/insert/delete in one go by tkf · Pull Request #33758 · JuliaLang/julia · GitHub enables this, and to my understanding is a reason why a token-based API doesn’t currently exist in Base. Maybe the current thread will reignite efforts on landing one of the two approaches.

Dan · December 26, 2022, 1:06am

Neither was I… but now I think the following incrementkey! does it:

using OrderedCollections

incrementkey!(d, k) = ( mergewith!(+, d, LittleDict((k,), (1,))) ;nothing)
classic_incrementkey!(d, k) = ( d[k] = 1 + get!(d, k, 0) ;nothing)

A little test:

julia> using BenchmarkTools

julia> d = Dict(4=>5, 3=>0, 1=>2);

julia> @btime incrementkey!($d, 3)
  10.763 ns (0 allocations: 0 bytes)

julia> @btime classic_incrementkey!($d, 3)
  14.533 ns (0 allocations: 0 bytes)

jlapeyre · December 26, 2022, 3:10pm

The last time I looked into it (a long time ago), you have to go behind the API for Dict. Whereas (as discussed above) avoiding looking up twice is part of the API for Dictionary. I wrote some things towards giving Dict and Dictionary a common interface here DictTools.jl. In one place, it avoids computing the hash twice like this

@inline function update!(dict::AbstractDictionary{T, V}, _key::T, func::F, default) where {F, T, V}
    (hasval, token) = gettoken(dict, _key)
    if hasval
        settokenvalue!(dict, token, func(gettokenvalue(dict, token)))
    else
        insert!(dict, _key, default)
    end
end


# Stolen from StatsBase. This is faster than naive method
@inline function update!(dict::Dict{T, V}, _key::T, func::F, default) where {F, T, V}
    index = Base.ht_keyindex2!(dict, _key)
    if index > 0
        @inbounds dict.vals[index] = func(dict.vals[index])
    else
        @inbounds Base._setindex!(dict, default, _key, -index)
    end
end

After using DictTools.jl, I think that in general trying to use a common interface is too much trouble.

But one of the main reasons that Andy wrote Dictionaries is to provide an associative array with an AbstractArray-like interface. I have found this useful. It is much easier to write a generic function that will take either a Vector or a Dictionary.

I first started DictTools for a couple of things: Provide a count map for Dict that accepts an iterator, rather than just a AbstractVector. Provide a count map for Dictionary. It’s natural to try to abstract part of this away with a common interface to the two implementations of dictionaries. And then to support Vector as well.

StefanKarpinski · December 28, 2022, 12:51am

Fwiw, adding a token API to Dict would be a reasonable thing. Someone just need to make a proposal/implementation. Could be based on Dictionaries.

andyferris · December 30, 2022, 7:08am

Agreed - there’s a lot of AbstractDict API work that could happen in Base without breaking changes.

mbauman · December 30, 2022, 3:36pm

A post was split to a new topic: Dictionary.jl’s token API