I am facing a performance issue that I find hard to understand so I am hoping I can gain some insights here. I have written code to solve an economic problem using fixed point iteration or also known as policy function iteration. The problem is defined on the Cartesian product of a number of grids.
The issue is the following: If I increase the upper bound of two of the grids, namely kgrid and bgrid, by a factor 2 the speed of the algorithm drops by a factor 2. I keep all else equal including the number of grid points. I tested the programme and I got the following results:
If the upper bound is 100, the time until convergence is 877.157062 seconds.
If the upper bound is 200, the time until convergence is 1537.049566 seconds.
I have included the code in this post. I have tried profiling the code but this did not make it clearer for me. Any suggestions to alleviate this problem would be much appreciated!
I am using Julia version 1.10.4,
Interpolations version 0.15.1,
NLsolve 4.5.1.
That’s a lot of code and a very long runtime so I won’t have time to dig into it but
(1) this looks a lot like Matlab, with many non-Julian and potentially performance-reducing idioms (collect all over the place, slicing arrays without taking views, creating lots of arrays even in tight loops) and
(2) as a result when running this with just 100 iterations I get
29.862649 seconds (698.93 M allocations: 31.719 GiB, 12.22% gc time)
At that level of allocations it’s not really helpful to reason about the performance behaviour of multithreaded code, so my advice is to read the Performance tips and make sure they’re all adhered to, get rid of the threading and work on getting your hot loop allocation free (or at least as close to it as possible, might not be entirely possible with NLsolve) and as fast as possible, then re-introduce threading.
Thank you for your suggestions. I have read tip and tried profiling allocations but I am no expert. I tried to preallocate as many arrays as possible and use inline replacement as well. I was able to decrease the amount allocations to a fourth of previous allocations but this is as far I get.
Do you perhaps any other suggestions for decreasing the number of allocations? pfi_noslice.jl (7.0 KB)
This looks a lot better, for me the difference from old to new is:
88.116497 seconds (1.11 G allocations: 50.229 GiB, 4.38% gc time, 1.33% compilation time: 60% of which was recompilation)
24.218082 seconds (65.35 M allocations: 3.360 GiB, 1.24% gc time, 4.28% compilation time: 59% of which was recompilation)
(for 100 iterations). A couple of simple changes get me to:
19.605032 seconds (39.38 M allocations: 1.833 GiB, 0.80% gc time, 3.45% compilation time)
namely:
pass FOC as an argument to function policy_function_iterate_simultaneous
Preallocate the X₀ outside the loop (X₀ = zeros(3)) and update it inside X₀ .= (c₀[i, j, z1, z2, z3], k₀[i, j, z1, z2, z3], b₀[i, j, z1, z2, z3])
Add a views to the slice of PI in the F! function: @views(PI[z,:])
Beyond that there’s nothing that jumps out from a cursory look, but gut feel 40m allocations and 1.8GB is still a lot unless this all comes from the calls to nlsolve and is unavoidable?
So I would try to quite carefully benchmark policy_function_iterate_simultaneous to see how much it allocates and where these allocations come from - there’s likely a better way to do the updating of the FOC but I’d have to think about it more closely.