Performance issues when working with dict

Hello,

I have some performance issues with dictionnaries, and I’m unsure about how to approach this problem. I profiled my code the other day, and I found that it spent a fair amount of time (that I wasn’t expecting) on the “getindex” function when fetching values with Dict.

The dictionary have around 100 keys (of custom immutable struct), each pointing to an array of custom immutable struct. The size of these arrays vary between themselves, and during run time, with a maximum of around 1000 elements.

I don’t really know how to approach the problem. Are dict intrinsically slow, and I should use another data structure ? Or is there maybe some implementation issues on my end ? Some workaround ?

Thanks for your help

Welcome! A few more details are needed here: what are the type parameters of your dict? Is it a Dict{Any,Any} or a Dict{YourCustomImmutableStruct, #= something =#}? Have you customized hashing for your immutable struct? What are its fields?

1 Like

Thanks for your reply, it is a Dict{order, Array{route,1}}, with order and routes being immutable struct. The dict is created with empty arrays, that are filled during run time. The order objects have two fields: an id, and an array (less than 10 elements) of other simple immutable struct.

I don’t understand the part about customized hashing, so I guess not.

Are order and route concrete types, i.e. isconcretetype(order)==true? If not that might be part of the problem.

Concerning the hashing, see Collections and Data Structures · The Julia Language and

help?> hash
search: hash hasmethod haskey hasfield hasproperty skipchars Threads MathConstants searchsortedlast

  hash(x[, h::UInt])

  Compute an integer hash code such that isequal(x,y) implies hash(x)==hash(y). The optional second argument h is a hash code to be mixed with the result.

  New types should implement the 2-argument form, typically by calling the 2-argument hash method recursively in order to mix hashes of the contents with each other (and with h). Typically,
  any type that implements hash should also implement its own == (hence isequal) to guarantee the property mentioned above. Types supporting subtraction (operator -) should also implement
  widen, which is required to hash values inside heterogeneous arrays.

If you implement custom hashing, consider using:
https://github.com/andrewcooke/AutoHashEquals.jl

2 Likes

Thanks for your answers!

So I had already implemented custom methods for == and isequal with my structures, but not for hash. I did it, and the performances improved significantly.

4 Likes

I’m facing a similar problem where I create a Dict{MyType, Float64} and fill it during runtime. Then I access this Dict many times in a loop using MyType.
When I run the profiler in this loop, I see that most of the time is spent in getindex.

Before, my type definition was (and I had a custom hash definiton)

struct MyType{S}
    x::Vector{S}
    y::S
end

I followed your suggestion and made it into a concrete type and used AutoHashEquals.jl so that

@auto_hash_equals struct MyType
    x::Vector{Int64}
    x::Int64
end

but performance has barely changed and the output from the profiler looks basically the same.

I know it is very difficult to troubleshoot without a MWE, but is there any other reason this could be happening?

According to your description, MyType in Dict{MyType, Float64} is incompletely specified and will result in a performance penalty. Try Dict{MyType{Float64}, Float64} (or whatever type you’re using for S). For maximum performance, you’ll also likely need to define Base.hash(::MyType,::UInt) as discussed in the earlier replies.

I think the issue you’re seeing is that hashing MyType (with auto_hash_equals too) involves hashing each of its individual fields. That, in turn, means hashing the Vector{Int64} field, and hashing a vector requires iterating over the vector and hashing all of the elements (for very long vectors it doesn’t hash all the elements, but it still does a lot of hashing). That’s a lot of work that has to be done every time you check any key in the dictionary.

I haven’t found a really satisfying answer to this. For keys you don’t plan to mutate, I’ve found it convenient to make the hash a field of the type (constructed automatically in the inner constructor) and then do Base.hash(x::MyType, y::UInt) = x.hash

2 Likes

Thank you for your answer! My Vector{Int64} is actually just a 3-element vector, so I think I might as well just separate it into individual elements and see if performance improves.

However, if you have the time, could you elaborate a little bit on your partial solution? What do you mean exactly by making the hash a field of the type?

Consider using StaticArrays and then the SVector{3,Int64} type instead of Vector{Int64}. SVectors provide really exceptional performance for small fixed-size arrays and won’t require you to re-implement any array logic or math with your custom type. If you don’t need it to have linear algebraic behavior, you could also just use a NTuple{3,Int64} to avoid the dependency on StaticArrays while achieving this performance. Either of these should save you from needing to implement your own hashing or other stuff.

1 Like

Oh, the idea was basically to add a hash::UInt field to the struct itself. I then compute that value (by hashing the other fields as usual) inside the inner constructor, and then I make Base.hash(x::MyType, seed::UInt) = hash(x.hash, seed). Something like:

julia> struct MyType
         x::NTuple{3, Int}
         hash::UInt
         
         MyType(x) = new(x, hash(x))
       end

julia> Base.hash(m::MyType, seed::UInt) = hash(m.hash, seed)

The key being that you now only call hash(x) once at construction time, rather than every time you hash the struct.

This only works if you actually avoid mutating the contents of the struct after construction time and if you also define == to be consistent with hash, so I’m not convinced it’s a great idea.

1 Like

(OP here) what I did to solve my problem was to define custom functions for ==, isequal and hash. The getindex was still a bit slow, but it wasn’t the bottleneck anymore. Since my structures are immutable, I can just give them a unique ID, as a field of the structure. For a structure called loc (used as keys of my dict) I just added this piece of code:

Base.:(==)(a::loc,b::loc) = (a.id == b.id)
Base.isequal(a::loc,b::loc) = isequal(a.id, b.id)
Base.hash(a::loc, h::UInt) = hash(a.id, hash(:loc,h))

I hope it works for you.