Performance problem on dual socket Xeon under Windows 11

Hello,

I have a question about parallelization in Julia.

I am working on a Dell dual socket Xeon 5220r workstation ( 2x24 cores, 2x48 threads) running Windows 11.

I was initially on Windows 10, but as this old post ( @threads uses only half the number of nthreads() - #13 by Chris_Green ) indicates, Windows 10 uses “Processor Groups” (Processor Groups - Win32 apps | Microsoft Learn) which (roughly) prevents Julia from using both sockets. Even if Julia sees all cores/threads (Threads.nthread() does return 96 = 2x48), the code only runs on 24 cores/48 threads. As mentioned in the last article, this “Processor Group” thing seems to have been abolished (finally…) in Windows 11, in favour of a “Primary Group” which groups the 2 sockets.

So I switched to Windows 11 and indeed, my Julia codes now run on the 48 cores/96 threads. However, the performance is not at all good and is quite random.

To say a little more about my code, it is basically a Monte Carlo simulation code, where I am a particle moving in a compartment system. I have a walk(particle_id) function that makes the particle move between the different compartments, and returns the list (Vector{Vector{Float64}} format) of the compartments the particle has entered.

The parallelization of my code consists in running this same function walk(particle_id) for many particles. I thus have the following loop:

particle_path_list=Vector{Vector{Vector{Float64}}}(undef,number_of_particle)

@time Threads.@threads for particle_id in 1:number_of_particle

    particle_path = walk(particle_id)
    particle_path_list[particle_id] = particle_path

end

On my laptop (Intel core 12800h), I take about 8s to simulate number_of_particle = 1000 , 14s for number_of_particle = 2000 … And the times remain consistent from one run to another. On my Workstation, the results vary a lot from one run to another, and can go up to 2min for 1000 particles

Typically on my Workstation, with Threads.nthread() = 96,

106.523241 seconds (145.48 M allocations: 8.471 GiB, 14.77% gc time, 3.52% compilation time)

105.975535 seconds (145.58 M allocations: 8.476 GiB, 15.56% gc time, 0.89% compilation time)

Knowing that the problem seems to remain the same when I manually switch to 46 threads for example. The same code on my laptop, with Threads.nthread() = 20

8.115906 seconds (146.15 M allocations: 8.509 GiB, 57.08% gc time, 6.72% compilation time)

I think the problem comes from the allocation of tasks to different threads by Windows. What do you think? What could I do to fix this problem? Is Threads.@threads really the best way to parallelize in my case?

Thanks !

1 Like

Side comment: I’ve made some initial steps towards windows support over in Windows support by carstenbauer · Pull Request #29 · carstenbauer/ThreadPinning.jl · GitHub. If, by any chance, you’re interested in these kind of things out would be great to get your help over there. In particular because you have a dual socket Windows system (which I don’t and thus can’t test things for).

1 Like

Several potential problems:

  • Too many allocations and a large pressure on the gc
  • It is usually advised to initialize the different piece of data with the same thread pattern to be used during the computation (the data position on the different RAM banks will be closer to the different computing cores).

Is it possible to preallocate all your data before computation by imposing max sizes for the different dimensions ? Something like

particle_path_list=Vector{Vector{Vector{Float64}}}()

@time Threads.@threads for particle_id in 1:number_of_particle
          particledata=[zeros(MAX1) for _ in 1:MAX2]
          push(particle_path_list,particledata)
end

@time Threads.@threads for particle_id in 1:number_of_particle
    walk!(particle_path_list[particle_id])
end
2 Likes

Probably irrelevant here, since timings “vary a lot” on the workstation,
but your desktop CPU might also have faster caches than the xeon.

1 Like

Update :
I installed Ubuntu 22.04 in dual boot on my machine to check if the problem really came from Windows, and I have the same result under Ubuntu. Apparently, the computation time is minimal when I manually set the number of threads to about 20, and then seems to increase when I increase the number of threads. So I blamed Windows a bit quickly :sweat_smile:

1 Like

I tried to comment out the line “particle_path_list[particle_id] = particle_path” (so I don’t write the results to the Vector anymore) but it doesn’t change anything. I will keep this advice in mind though, thanks!

It seems coherent because particle_path is a vector of vector that is actually created (and allocated) by the walk function.

My suggestion was a bit different and maybe cumbersome (or impossible) to implement.
Replacing the walk function by a new walk! function that does not allocate should show if the gc pressure is the problem.