I have the following scenario. The function I am evaluating uses multi-threading, and it takes 5 to 6s to finish. But sometimes, to do exactly the same thing, it takes 2, 3 times that time. The machine is not running anything else. I am using @threads
, and I suspected initially that one of the threads could have stalled for a while, such that I tried using @spawn
to allow for a uneven distribution of the tasks. The distribution became uneven, but this behavior was not suppressed. Any insights?
For example:
julia> @time CellListMap.florpi(N=2_000_000,cd=false,parallel=true)
4.728705 seconds (754.30 k allocations: 639.686 MiB, 1.85% gc time)
(false, [3.389763410292043e-5, 8.430492327590355e-6, -2.056429147826521e-5, 1.165175566831777e-5, -2.8093953887987037e-6])
julia> @time CellListMap.florpi(N=2_000_000,cd=false,parallel=true)
8.343879 seconds (754.29 k allocations: 639.686 MiB, 1.69% gc time)
(false, [3.389763410292043e-5, 8.430492327590355e-6, -2.056429147826521e-5, 1.165175566831777e-5, -2.8093953887987037e-6])
julia> @time CellListMap.florpi(N=2_000_000,cd=false,parallel=true)
14.456259 seconds (754.30 k allocations: 639.686 MiB, 0.34% gc time)
(false, [3.389763410292043e-5, 8.430492327590355e-6, -2.056429147826521e-5, 1.165175566831777e-5, -2.8093953887987037e-6])
julia> @time CellListMap.florpi(N=2_000_000,cd=false,parallel=true)
11.053437 seconds (754.30 k allocations: 639.686 MiB, 1.57% gc time)
(false, [3.389763410292043e-5, 8.430492327590355e-6, -2.056429147826521e-5, 1.165175566831777e-5, -2.8093953887987037e-6])
julia> @time CellListMap.florpi(N=2_000_000,cd=false,parallel=true)
4.808746 seconds (754.30 k allocations: 639.686 MiB, 3.78% gc time)
(false, [3.389763410292043e-5, 8.430492327590355e-6, -2.056429147826521e-5, 1.165175566831777e-5, -2.8093953887987037e-6])
If someone has any will or reason to inspect this further, to reproduce the example, do:
julia> add CellListMap
julia> using CellListMap
julia> @time CellListMap.florpi(N=2_000_000,cd=false,parallel=true)
The parallel loop, now, is structured as:
@threads for threadid in 1:nthreads()
for i in splitter(threadid, n_cells_with_real_particles)
cellᵢ = cl.cells[cl.cell_indices_real[i]]
output_threaded[threadid] =
inner_loop!(f, box, cellᵢ, cl, output_threaded[threadid], threadid)
show_progress && next!(p)
end
end
thus I’m threading all the work to each thread at once, reason why I suspected that if one thread gets stalled for some time, everything slows down. But I changed that to:
@sync Threads.@spawn for i in 1:n_cells_with_real_particles
threadid = Threads.threadid()
cellᵢ = cl.cells[cl.cell_indices_real[i]]
output_threaded[threadid] =
inner_loop!(f, box, cellᵢ, cl, output_threaded[threadid], threadid)
end
But the performance got much worse (30s). Probably I’m not doing something right there (or the work performed by each thread is too fast), although the distribution of the tasks on each thread became uneven (which I was expecting that could solve the “stalled” thread problem, if that was the case).