Dictionary construction is quite slow?

christmas_jones · June 28, 2022, 9:40pm

Just comparing python to julia for dictionary construction, and it seems very slow. In my particular use case it is much worse even than the MWE below (where my keys are more complicated).
So my question is, what is the best way to set-up (and update) a dictionary in Julia? Note, in general I do not know the keys ahead of time in my use case of interest (I know their type however), and I construct basically in the manner below:

using Dates

rands = rand(10^7)
d = Dict{Int, Float64}()

tic = now()
for (i, r) in enumerate(rands)
    d[i] = r
end
toc = now()

print(toc-tic)

I find this takes around 4s on my machine (Macbook Pro M1 chip).

Compare to Python which is less than 1s on the same machine.

from time import time
import numpy as np

rands = np.random.rand(10**7)
d = {}
tic = time()
for i, r in enumerate(rands):
    d[i] = r
toc = time()

print(toc-tic)

I’m using the ‘experimental’ V1.8 for Mac M1 for the above particular times, so perhaps this is a factor, but, I seem to get similar results on the Intel V1.7, so I don’t know if that is something to consider…

Any help appreciated!

Oscar_Smith · June 28, 2022, 9:42pm

stop timing in the global score.

christmas_jones · June 28, 2022, 9:45pm

Hi!

Can you elaborate on that?

The post was just an example, but in my actual code when i’m not doing this naive timing method, it is unambiguously very slow, e.g. even if I just time it on my stopwatch and find the python code takes e.g. one min and the other takes over 10 mins, where all it is doing is creating and updating dicts.

Thanks for the quick response!

Oscar_Smith · June 28, 2022, 9:57pm

I recommend reading Performance Tips · The Julia Language

function f(d,rands)
    for (i, r) in enumerate(rands)
        d[i] = r
    end
end
rands = rand(10^7)
@time  f(Dict{Int, Float64}(), rands);
0.857354 seconds (72 allocations: 541.170 MiB, 1.51% gc time)

christmas_jones · June 28, 2022, 10:06pm

Wow, thanks for that, this has been unexpected to me!

Does this mean generally one should put dictionary updates/construction inside specific functions? Even if the update is already occurring inside some larger function?

Thanks for the link, I will certainly take a look.

Oscar_Smith · June 28, 2022, 10:10pm

In general, you should do any computation that takes time in a function.

christmas_jones · June 28, 2022, 10:13pm

Thanks a lot for the help. This is great to know! I gave you the ‘solution’. Have a nice day!

stillyslalom · June 28, 2022, 10:23pm

When iteratively constructing a large data structure of known final size, you can often speed things up a bit more with sizehint!, which allows Julia to allocate a sufficiently large data structure from the outset rather than repeatedly re-allocating more memory as the structure grows:

julia> @time  f(Dict{Int, Float64}(), rands)
  1.085096 seconds (72 allocations: 541.170 MiB)

julia> @time  f(sizehint!(Dict{Int, Float64}(), length(rands)), rands)
  0.953690 seconds (8 allocations: 272.001 MiB)

christmas_jones · June 28, 2022, 10:28pm

Thanks also for the tip! Didn’t know about this either

stillyslalom · June 28, 2022, 10:33pm

One more tip: Julia will take care of the right-sizing for you if you fill the Dict with an iterator.

julia> g(v) = Dict(i => r for (i, r) in enumerate(v))
g (generic function with 1 method)

julia> @time g(rands)
  0.754952 seconds (7 allocations: 272.001 MiB)

christmas_jones · June 28, 2022, 10:40pm

Thanks all! You guys are the best! I managed to improve my code performance by using functions more throughout my code, and using the sizehint

dylanxyz · June 28, 2022, 10:40pm

Even if the known-size is not the real final size, but is close to it, using sizehint! would still work?

mikmoore · June 28, 2022, 10:53pm

To clarify: no.

Yes, anything performance sensitive should be inside a function. No, it doesn’t have to be nested further inside another function call (exception: function barriers can be useful in certain situations to minimize disruptions from uncertain types).

As mentioned, the issue with your initial benchmark was that you were running naked code in the REPL. This is fine, but it’s unlikely to achieve the same performance it could inside a function. A function scope allows the compiler to assume things it couldn’t safely do in a generic setting.

christmas_jones · June 28, 2022, 11:03pm

That’s good to know. In my use case I had a main() (julia) function called from python via pyjulia, and I found that inside that main() it was beneficial to put certain components inside functions. I suppose in this case the main() was acting in effect like naked code, since it’s called at the highest level.

mikmoore · June 29, 2022, 3:05pm

The code being within your main function is all that should be strictly necessary – although writing more small-ish functions (rather than fewer mega-functions) is typically a good policy in Julia anyway. It’s likely that your additional compartmentalization served as functions barriers to enable better optimization within them.

One of the biggest advantages of a function scope is the ability to reason precisely about variable types. For this to actually happen, however, it is important that your code be as type stable as possible. Unstable code requires extra machinery to handle uncertain types. Function barriers mitigate this by consolidating the uncertainty to a single point so that it only needs to be handled once.

The fact that your performance improves with compartmentalization suggests that your main function might contain type instability. Julia is a dynamic language so that’s totally okay. If you do, however, manage to eliminate those instabilities (that I have only hypothesized to exist) you may squeeze out a little extra performance. Although I offer no promises that you aren’t already 95% of the way there.

Consider calling @code_warntype main(<typical arguments>) and looking for ::Any annotations or other parts highlighted in red. Those indicate places where variables have uncertain types (::Union types appear in yellow but are rarely cause for concern). Adjusting your code so that the compiler can fully infer those variable types is important in performance-sensitive code segments. The Cthulhu package offers support for interactively navigating more complicated functions with similar information to @code_warntype. If you have some relatively small functions that you’re struggle to make stable, consider opening another post and people here can help.

christmas_jones · June 29, 2022, 4:16pm

Thanks for the detailed comment! It is appreciated, i’m learning a lot here!

Yes this could have been the reason (though I can’t say for sure as i’m not knowledgeable enough about julia compilation); it seems when Python passes certain objects via pyjulia they always come through as ‘Any’ types. In my case, a python dictionary came as Dict[Any, Any], despite the fact the python dictionary was a dict of N integer tuples to complex. This is understandable since python does not have fixed (or as rigorous) types, and I could not figure out how to get the pyjulia call to pass a dict except of that generic type.

So perhaps by putting in functions inside main() it allowed the compiler to specialize more given that generic input. Maybe I could have got away with converting the input type up front in main() within Julia to a more specific dictionary type.

Thanks again!

Topic		Replies	Views
Poor time performance on Dict? Performance	26	19086	March 12, 2018
Performance of Dictionaries.jl vs Base.Dict Internals & Design package , performance , dictionary , dictionaries	32	3220	December 13, 2024
Performance of Dict Performance	1	956	November 12, 2019
Code seems slow New to Julia	6	639	October 30, 2018
Best performance using dictionaries, functions and modules Performance question	8	868	June 5, 2020

Dictionary construction is quite slow?

Related topics