Hello,
I have a question about parallelization in Julia.
I am working on a Dell dual socket Xeon 5220r workstation ( 2x24 cores, 2x48 threads) running Windows 11.
I was initially on Windows 10, but as this old post ( @threads uses only half the number of nthreads() - #13 by Chris_Green ) indicates, Windows 10 uses “Processor Groups” (Processor Groups - Win32 apps | Microsoft Learn) which (roughly) prevents Julia from using both sockets. Even if Julia sees all cores/threads (Threads.nthread() does return 96 = 2x48), the code only runs on 24 cores/48 threads. As mentioned in the last article, this “Processor Group” thing seems to have been abolished (finally…) in Windows 11, in favour of a “Primary Group” which groups the 2 sockets.
So I switched to Windows 11 and indeed, my Julia codes now run on the 48 cores/96 threads. However, the performance is not at all good and is quite random.
To say a little more about my code, it is basically a Monte Carlo simulation code, where I am a particle moving in a compartment system. I have a walk(particle_id) function that makes the particle move between the different compartments, and returns the list (Vector{Vector{Float64}} format) of the compartments the particle has entered.
The parallelization of my code consists in running this same function walk(particle_id) for many particles. I thus have the following loop:
particle_path_list=Vector{Vector{Vector{Float64}}}(undef,number_of_particle)
@time Threads.@threads for particle_id in 1:number_of_particle
particle_path = walk(particle_id)
particle_path_list[particle_id] = particle_path
end
On my laptop (Intel core 12800h), I take about 8s to simulate number_of_particle = 1000
, 14s for number_of_particle = 2000
… And the times remain consistent from one run to another. On my Workstation, the results vary a lot from one run to another, and can go up to 2min for 1000 particles
Typically on my Workstation, with Threads.nthread() = 96,
106.523241 seconds (145.48 M allocations: 8.471 GiB, 14.77% gc time, 3.52% compilation time)
105.975535 seconds (145.58 M allocations: 8.476 GiB, 15.56% gc time, 0.89% compilation time)
Knowing that the problem seems to remain the same when I manually switch to 46 threads for example. The same code on my laptop, with Threads.nthread() = 20
8.115906 seconds (146.15 M allocations: 8.509 GiB, 57.08% gc time, 6.72% compilation time)
I think the problem comes from the allocation of tasks to different threads by Windows. What do you think? What could I do to fix this problem? Is Threads.@threads really the best way to parallelize in my case?
Thanks !