Performance issues when working with dict

Deoch · January 13, 2022, 4:49pm

Hello,

I have some performance issues with dictionnaries, and I’m unsure about how to approach this problem. I profiled my code the other day, and I found that it spent a fair amount of time (that I wasn’t expecting) on the “getindex” function when fetching values with Dict.

The dictionary have around 100 keys (of custom immutable struct), each pointing to an array of custom immutable struct. The size of these arrays vary between themselves, and during run time, with a maximum of around 1000 elements.

I don’t really know how to approach the problem. Are dict intrinsically slow, and I should use another data structure ? Or is there maybe some implementation issues on my end ? Some workaround ?

Thanks for your help

mbauman · January 13, 2022, 4:52pm

Welcome! A few more details are needed here: what are the type parameters of your dict? Is it a Dict{Any,Any} or a Dict{YourCustomImmutableStruct, #= something =#}? Have you customized hashing for your immutable struct? What are its fields?

Deoch · January 13, 2022, 5:02pm

Thanks for your reply, it is a Dict{order, Array{route,1}}, with order and routes being immutable struct. The dict is created with empty arrays, that are filled during run time. The order objects have two fields: an id, and an array (less than 10 elements) of other simple immutable struct.

I don’t understand the part about customized hashing, so I guess not.

mauro3 · January 14, 2022, 7:32am

Are order and route concrete types, i.e. isconcretetype(order)==true? If not that might be part of the problem.

Concerning the hashing, see Collections and Data Structures · The Julia Language and

help?> hash
search: hash hasmethod haskey hasfield hasproperty skipchars Threads MathConstants searchsortedlast

  hash(x[, h::UInt])

  Compute an integer hash code such that isequal(x,y) implies hash(x)==hash(y). The optional second argument h is a hash code to be mixed with the result.

  New types should implement the 2-argument form, typically by calling the 2-argument hash method recursively in order to mix hashes of the contents with each other (and with h). Typically,
  any type that implements hash should also implement its own == (hence isequal) to guarantee the property mentioned above. Types supporting subtraction (operator -) should also implement
  widen, which is required to hash values inside heterogeneous arrays.

If you implement custom hashing, consider using:
https://github.com/andrewcooke/AutoHashEquals.jl

Deoch · January 14, 2022, 3:43pm

Thanks for your answers!

So I had already implemented custom methods for == and isequal with my structures, but not for hash. I did it, and the performances improved significantly.

cosmia · November 11, 2022, 12:19am

I’m facing a similar problem where I create a Dict{MyType, Float64} and fill it during runtime. Then I access this Dict many times in a loop using MyType.
When I run the profiler in this loop, I see that most of the time is spent in getindex.

Before, my type definition was (and I had a custom hash definiton)

struct MyType{S}
    x::Vector{S}
    y::S
end

I followed your suggestion and made it into a concrete type and used AutoHashEquals.jl so that

@auto_hash_equals struct MyType
    x::Vector{Int64}
    x::Int64
end

but performance has barely changed and the output from the profiler looks basically the same.

I know it is very difficult to troubleshoot without a MWE, but is there any other reason this could be happening?

mikmoore · November 11, 2022, 3:44pm

According to your description, MyType in Dict{MyType, Float64} is incompletely specified and will result in a performance penalty. Try Dict{MyType{Float64}, Float64} (or whatever type you’re using for S). For maximum performance, you’ll also likely need to define Base.hash(::MyType,::UInt) as discussed in the earlier replies.

rdeits · November 11, 2022, 4:39pm

I think the issue you’re seeing is that hashing MyType (with auto_hash_equals too) involves hashing each of its individual fields. That, in turn, means hashing the Vector{Int64} field, and hashing a vector requires iterating over the vector and hashing all of the elements (for very long vectors it doesn’t hash all the elements, but it still does a lot of hashing). That’s a lot of work that has to be done every time you check any key in the dictionary.

I haven’t found a really satisfying answer to this. For keys you don’t plan to mutate, I’ve found it convenient to make the hash a field of the type (constructed automatically in the inner constructor) and then do Base.hash(x::MyType, y::UInt) = x.hash

cosmia · November 11, 2022, 8:34pm

Thank you for your answer! My Vector{Int64} is actually just a 3-element vector, so I think I might as well just separate it into individual elements and see if performance improves.

However, if you have the time, could you elaborate a little bit on your partial solution? What do you mean exactly by making the hash a field of the type?

mikmoore · November 11, 2022, 8:43pm

Consider using StaticArrays and then the SVector{3,Int64} type instead of Vector{Int64}. SVectors provide really exceptional performance for small fixed-size arrays and won’t require you to re-implement any array logic or math with your custom type. If you don’t need it to have linear algebraic behavior, you could also just use a NTuple{3,Int64} to avoid the dependency on StaticArrays while achieving this performance. Either of these should save you from needing to implement your own hashing or other stuff.

rdeits · November 15, 2022, 6:00pm

Oh, the idea was basically to add a hash::UInt field to the struct itself. I then compute that value (by hashing the other fields as usual) inside the inner constructor, and then I make Base.hash(x::MyType, seed::UInt) = hash(x.hash, seed). Something like:

julia> struct MyType
         x::NTuple{3, Int}
         hash::UInt
         
         MyType(x) = new(x, hash(x))
       end

julia> Base.hash(m::MyType, seed::UInt) = hash(m.hash, seed)

The key being that you now only call hash(x) once at construction time, rather than every time you hash the struct.

This only works if you actually avoid mutating the contents of the struct after construction time and if you also define == to be consistent with hash, so I’m not convinced it’s a great idea.

Deoch · November 16, 2022, 2:05pm

(OP here) what I did to solve my problem was to define custom functions for ==, isequal and hash. The getindex was still a bit slow, but it wasn’t the bottleneck anymore. Since my structures are immutable, I can just give them a unique ID, as a field of the structure. For a structure called loc (used as keys of my dict) I just added this piece of code:

Base.:(==)(a::loc,b::loc) = (a.id == b.id)
Base.isequal(a::loc,b::loc) = isequal(a.id, b.id)
Base.hash(a::loc, h::UInt) = hash(a.id, hash(:loc,h))

I hope it works for you.

Topic		Replies	Views
Poor time performance on Dict? Performance	26	19014	March 12, 2018
Clarification about memory management of immutable and mutable struct Internals & Design data_structures	12	5080	November 14, 2019
Dictionary with custom hash function General Usage question , dictionary , hash	7	2481	November 23, 2023
Small <: AbstractDict implementation General Usage question	9	1103	February 13, 2019
Performance of Dict Performance	1	939	November 12, 2019

Performance issues when working with dict

Related topics