Lots of places; Dictionaries.jl’s packages are just much faster than Base Dict
s to iterate over. There’s a reason most package designers use them rather than Dict
s.
I would be very interested in seeing a direct benchmark showing that Dict is five times slower than Dictionaries.
Using my benchmark above, it seems that Dictionary
is slower than Dict
for getindex
(mul1!
benchmark), but faster for iterating over pairs(dict)
(mul2!
benchmark).
In particular, this code:
Code
function mul1!(y, D, x)
for i = 1:length(y)
s = zero(eltype(y))
@inbounds for j = 1:length(x)
s += D[CartesianIndex((i,j))] * x[j]
end
y[i] = s
end
return y
end
function mul2!(y, D, x)
y .= 0
for (k,v) in pairs(D)
i,j = Tuple(k)
@inbounds y[i] += v * x[j]
end
return y
end
using Dictionaries, BenchmarkTools
A = rand(1000,1000)
x = rand(1000)
y = similar(x)
D = Dict(pairs(A))
D2 = Dictionary(D)
gives:
julia> @btime mul1!($y, $D, $x); @btime mul1!($y, $D2, $x);
67.968 ms (0 allocations: 0 bytes)
110.422 ms (0 allocations: 0 bytes)
julia> @btime mul2!($y, $D, $x); @btime mul2!($y, $D2, $x);
9.907 ms (0 allocations: 0 bytes)
1.897 ms (0 allocations: 0 bytes)
If you look at the way Dictionary
is implemented, it has separate contiguous arrays of the keys and values in the dictionary, which makes it fast to iterate, at the expense of an additional indirection during getindex
(via a separate array of indices into the contiguous data).
cc @andyferris
Yes. I think Andy writes about this in several places.
Like in Getting Started in the README.
The three main difference to
Dict
are that it preserves the order of elements, it iterates much faster, and it iterates values rather than key-value pairs.
His JuliaCon talk is very clear.
If you do a lot of insertion and deletion and not much iterating over keys and values, then Dict
might be faster. If you do a lot of iterating, and not so much insertion and deletion, then Dictionary
Ah yes, that’s a known trade off for ordered dicts—iteration gets much faster but other operations get a bit slower. Describing that as “Dictionaries are five times faster than Dict” is a bit misleading.
Dictonaries.jl also has the tokens API which helps with performance.
yeah, we really need to add the token API to base.
For a less-gifted individual like me, can you offer a basic example of how to use tokens and how they help?
The idea of a token is that a lot of the time people will write code like the following
D = get_random(dict)
if !(a in keys(D))
D[a] = 1
end
This looks decent, but requires doing a duplicate lookup in D
. Using a token will save a call to hash(a)
and may save some lookup costs.
There is get!
for that pattern. It’s more when you want to update an existing index D[a] += 1
where I don’t think there is a good way to do it now without double lookup.
get!
can be used for my case, but since it isn’t lazy it sometimes isn’t a good option.
There’s a lazy method where the value is computed by a function call.
AFAIK there is no great way to se a Base.Dict to implement a “lazy accumulator”, like this:
if haskey(dict, key)
dict[key] += 1
else
dict[key] = 1
end
This needs three lookups. We can reduce this to two via get!
:
dict[key] = 1 + get!(dict, key, 0)
But I am not aware of way to get it down to 1 (if there is one, I’d love to learn about it).
Add modify! function for lookup/update/insert/delete in one go by tkf · Pull Request #33758 · JuliaLang/julia · GitHub enables this, and to my understanding is a reason why a token-based API doesn’t currently exist in Base. Maybe the current thread will reignite efforts on landing one of the two approaches.
Neither was I… but now I think the following incrementkey!
does it:
using OrderedCollections
incrementkey!(d, k) = ( mergewith!(+, d, LittleDict((k,), (1,))) ;nothing)
classic_incrementkey!(d, k) = ( d[k] = 1 + get!(d, k, 0) ;nothing)
A little test:
julia> using BenchmarkTools
julia> d = Dict(4=>5, 3=>0, 1=>2);
julia> @btime incrementkey!($d, 3)
10.763 ns (0 allocations: 0 bytes)
julia> @btime classic_incrementkey!($d, 3)
14.533 ns (0 allocations: 0 bytes)
The last time I looked into it (a long time ago), you have to go behind the API for Dict
. Whereas (as discussed above) avoiding looking up twice is part of the API for Dictionary
. I wrote some things towards giving Dict
and Dictionary
a common interface here DictTools.jl. In one place, it avoids computing the hash twice like this
@inline function update!(dict::AbstractDictionary{T, V}, _key::T, func::F, default) where {F, T, V}
(hasval, token) = gettoken(dict, _key)
if hasval
settokenvalue!(dict, token, func(gettokenvalue(dict, token)))
else
insert!(dict, _key, default)
end
end
# Stolen from StatsBase. This is faster than naive method
@inline function update!(dict::Dict{T, V}, _key::T, func::F, default) where {F, T, V}
index = Base.ht_keyindex2!(dict, _key)
if index > 0
@inbounds dict.vals[index] = func(dict.vals[index])
else
@inbounds Base._setindex!(dict, default, _key, -index)
end
end
After using DictTools.jl
, I think that in general trying to use a common interface is too much trouble.
But one of the main reasons that Andy wrote Dictionaries
is to provide an associative array with an AbstractArray
-like interface. I have found this useful. It is much easier to write a generic function that will take either a Vector
or a Dictionary
.
I first started DictTools
for a couple of things: Provide a count map for Dict
that accepts an iterator, rather than just a AbstractVector
. Provide a count map for Dictionary
. It’s natural to try to abstract part of this away with a common interface to the two implementations of dictionaries. And then to support Vector
as well.
Fwiw, adding a token API to Dict would be a reasonable thing. Someone just need to make a proposal/implementation. Could be based on Dictionaries.
Agreed - there’s a lot of AbstractDict
API work that could happen in Base
without breaking changes.
A post was split to a new topic: Dictionary.jl’s token API