Huge performance fluctuations in parallel benchmark: insights?

lmiq · September 9, 2021, 6:58pm

I have the following scenario. The function I am evaluating uses multi-threading, and it takes 5 to 6s to finish. But sometimes, to do exactly the same thing, it takes 2, 3 times that time. The machine is not running anything else. I am using @threads, and I suspected initially that one of the threads could have stalled for a while, such that I tried using @spawn to allow for a uneven distribution of the tasks. The distribution became uneven, but this behavior was not suppressed. Any insights?

For example:

julia> @time CellListMap.florpi(N=2_000_000,cd=false,parallel=true)
  4.728705 seconds (754.30 k allocations: 639.686 MiB, 1.85% gc time)
(false, [3.389763410292043e-5, 8.430492327590355e-6, -2.056429147826521e-5, 1.165175566831777e-5, -2.8093953887987037e-6])

julia> @time CellListMap.florpi(N=2_000_000,cd=false,parallel=true)
  8.343879 seconds (754.29 k allocations: 639.686 MiB, 1.69% gc time)
(false, [3.389763410292043e-5, 8.430492327590355e-6, -2.056429147826521e-5, 1.165175566831777e-5, -2.8093953887987037e-6])

julia> @time CellListMap.florpi(N=2_000_000,cd=false,parallel=true)
 14.456259 seconds (754.30 k allocations: 639.686 MiB, 0.34% gc time)
(false, [3.389763410292043e-5, 8.430492327590355e-6, -2.056429147826521e-5, 1.165175566831777e-5, -2.8093953887987037e-6])

julia> @time CellListMap.florpi(N=2_000_000,cd=false,parallel=true)
 11.053437 seconds (754.30 k allocations: 639.686 MiB, 1.57% gc time)
(false, [3.389763410292043e-5, 8.430492327590355e-6, -2.056429147826521e-5, 1.165175566831777e-5, -2.8093953887987037e-6])

julia> @time CellListMap.florpi(N=2_000_000,cd=false,parallel=true)
  4.808746 seconds (754.30 k allocations: 639.686 MiB, 3.78% gc time)
(false, [3.389763410292043e-5, 8.430492327590355e-6, -2.056429147826521e-5, 1.165175566831777e-5, -2.8093953887987037e-6])

If someone has any will or reason to inspect this further, to reproduce the example, do:

julia> add CellListMap

julia> using CellListMap

julia> @time CellListMap.florpi(N=2_000_000,cd=false,parallel=true)

The parallel loop, now, is structured as:

    @threads for threadid in 1:nthreads() 
        for i in splitter(threadid, n_cells_with_real_particles)
            cellᵢ = cl.cells[cl.cell_indices_real[i]]
            output_threaded[threadid] = 
                inner_loop!(f, box, cellᵢ, cl, output_threaded[threadid], threadid) 
            show_progress && next!(p)
        end
    end

thus I’m threading all the work to each thread at once, reason why I suspected that if one thread gets stalled for some time, everything slows down. But I changed that to:

    @sync Threads.@spawn for i in 1:n_cells_with_real_particles
        threadid = Threads.threadid()
        cellᵢ = cl.cells[cl.cell_indices_real[i]]
        output_threaded[threadid] = 
            inner_loop!(f, box, cellᵢ, cl, output_threaded[threadid], threadid) 
    end

But the performance got much worse (30s). Probably I’m not doing something right there (or the work performed by each thread is too fast), although the distribution of the tasks on each thread became uneven (which I was expecting that could solve the “stalled” thread problem, if that was the case).

gbaraldi · September 9, 2021, 7:26pm

Checking something like htop can you see if any threads stall? Also if you call GC.gc() between runs, do you see the same behaviour?

lmiq · September 9, 2021, 7:42pm

No, for both questions. I can’t see any pattern related to GC, no.

One thing I noticed how is that the computer has 16 physical cores, and allows 32 threads to run. The threads are run into any of the 32 available threads. I don’t know if that has any performance implications.

julia> GC.gc()

julia> @time CellListMap.florpi(N=2_000_000,cd=false,parallel=true)
 12.337934 seconds (754.29 k allocations: 639.686 MiB, 1.05% gc time)
(false, [3.389763410292043e-5, 8.430492327590355e-6, -2.056429147826521e-5, 1.165175566831777e-5, -2.8093953887987037e-6])

julia> GC.gc()

julia> @time CellListMap.florpi(N=2_000_000,cd=false,parallel=true)
  4.566190 seconds (754.29 k allocations: 639.686 MiB, 0.97% gc time)
(false, [3.389763410292043e-5, 8.430492327590355e-6, -2.056429147826521e-5, 1.165175566831777e-5, -2.8093953887987037e-6])

julia> GC.gc()

julia> @time CellListMap.florpi(N=2_000_000,cd=false,parallel=true)
 10.245200 seconds (754.29 k allocations: 639.686 MiB, 0.82% gc time)
(false, [3.389763410292043e-5, 8.430492327590355e-6, -2.056429147826521e-5, 1.165175566831777e-5, -2.8093953887987037e-6])

julia> GC.gc()

julia> @time CellListMap.florpi(N=2_000_000,cd=false,parallel=true)
  9.696066 seconds (754.29 k allocations: 639.686 MiB, 1.31% gc time)
(false, [3.389763410292043e-5, 8.430492327590355e-6, -2.056429147826521e-5, 1.165175566831777e-5, -2.8093953887987037e-6])

julia> GC.gc()

julia> @time CellListMap.florpi(N=2_000_000,cd=false,parallel=true)
  4.630294 seconds (754.29 k allocations: 639.686 MiB, 0.96% gc time)
(false, [3.389763410292043e-5, 8.430492327590355e-6, -2.056429147826521e-5, 1.165175566831777e-5, -2.8093953887987037e-6])

Henrique_Becker · September 9, 2021, 9:52pm

Some advice from someone a little paranoid about this kinda of noise.

Physical cores and “virtual cores” are often distinct. A machine can have 16 physical cores but each physical core have two “virtual” cores. This is relatively common. In such cases, these two virtual cores inside the same physical will share at least the first levels of cache, that is never shared between physical cores. If you are using linux, what is the output of lscpu?
I had problems with a server that used “bulldozer” technology, this is a CPU technology that allows for two virtual cores inside each physical core and they also share the floating-point part of the CPU, in other words, if my computation was floating point intensive (like solving a linear model with simplex) then it was better to use the number of physical cores instead of virtual cores, because two threads fighting for memory and floating-point operations were less effective than just one not having to fight with a sibling thread.

lmiq · September 9, 2021, 11:33pm

As far as I understand, I have 16 physical cores (32 virtual). I’m running with julia -t 8. This is what lspcu says:

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                32
On-line CPU(s) list:   0-31
Thread(s) per core:    2
Core(s) per socket:    8
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 62
Model name:            Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz
Stepping:              4
CPU MHz:               2999.851
BogoMIPS:              5205.20
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              20480K
NUMA node0 CPU(s):     0-7,16-23
NUMA node1 CPU(s):     8-15,24-31
leandro@p001:~%

This is how the time depends on the problem size (number of particles):

10000 0.004079152
50000 0.021795701
100000 0.04603988
200000 0.11528972
300000 0.20175428
400000 0.28964232
500000 0.42249202
600000 0.56320058
700000 0.77138363
800000 0.96019080
900000 1.16920708
1000000 1.3880650
1500000 2.7459242
2000000 12.991955
3000000 25.495138

As one can see, about 5-6s would be the expected time for 2M particles (and 13s for the 3M particles). However, I often get this outlier long running times for these two cases (not for the others, which rarely fall out of the curve).

tkf · September 10, 2021, 12:04am

lmiq:

    @sync Threads.@spawn for i in 1:n_cells_with_real_particles
        threadid = Threads.threadid()
        cellᵢ = cl.cells[cl.cell_indices_real[i]]
        output_threaded[threadid] = 
            inner_loop!(f, box, cellᵢ, cl, output_threaded[threadid], threadid) 
    end

@sync @spawn expression is a slow version of expression (more or less) and it doesn’t introduce any parallelism to your code.

Also, mixing output_threaded[Threads.threadid()] and Threads.@spawn is very likely wrong. See: FAQ · FLoops

You’d probably want something like this

    @sync for taskid in 1:nthreads() 
        Threads.@spawn for i in splitter(taskid, n_cells_with_real_particles)
            cellᵢ = cl.cells[cl.cell_indices_real[i]]
            output_threaded[taskid] = 
                inner_loop!(f, box, cellᵢ, cl, output_threaded[taskid], taskid) 
            show_progress && next!(p)
        end
    end

lmiq · September 10, 2021, 12:36am

Thanks. The behavior of @spawn is something that admittedly I don’t understand quite right (that is why I’m not using it…).

In that case, isn’t it possible that two tasks write to the same output_threaded[taskid]? What guarantees there that two processes are not writing at the same time to the same output_threaded element?

(as a side comment, I changed the code to that now and recovered the same behavior as I’m getting with @threads, that is, same performance and same performance fluctuations).

And yes, I’m aware of the possible problems of using threadid() and spawn in combination (I’m following all your comments here, thanks, ), another reason I’m not using it.

dlakelan · September 10, 2021, 1:02am

No because each spawn gets a different taskid because they’re spawned inside the outer loop.

tkf · September 10, 2021, 1:07am

Yes, the pattern

@sync for taskid in 1:ntasks
    Threads.@spawn f(taskid)
end

behaves much more nicely than threadid because it ensures each task has a unique taskid. It’s just a plain local variable inside the task. So it won’t change and is not shared with other tasks.

Also, I think determining performance fluctuation would be very hard when there’s any allocation (even if it’s a very tiny one). Calling GC.gc before benchmark doesn’t help because when the GC tries to stop world with respect to other task is non-deterministic.

ImreSamu · September 10, 2021, 5:27am

1.)
Is this a Cloud or a “private server”?

For example, some Cloud providers selling shared vCPU.

in the Hetzner ( https://www.hetzner.com/cloud )
has a

“default” vCPU
- not mentioning, but this is shared with other users;
“Dedicated vCPU Instances: Maximum performance with dedicated vCPUs. Ideal for CPU intensive applications like highly frequented web servers, video encoding, machine learning or research calculations.”

2.)
My second idea is the NUMA increased synchronization costs.

https://github.com/JuliaLang/julia/search?q=NUMA&type=issues

NUMA node0 CPU(s):     0-7,16-23
NUMA node1 CPU(s):     8-15,24-31

johnh · September 10, 2021, 6:14am

lscpu is a good tool to show your CPU types and cores
Even better use hwloc / lstopo

https://www.open-mpi.org/projects/hwloc/lstopo/
For linux yum install hwloc or apt-get install hwloc

johnh · September 10, 2021, 6:16am

If this is a physical machine which you have access to have you tried disabling hyperthreading in the BIOS?

This article is good too

lmiq · September 10, 2021, 10:57am

I see, but probably there is a missunderstanding. In my case I have an output which has nthreads() copies of the data, not ntasks copies, for threading.

This is a simplified version of your suggestion above:

    output_threaded = [ deepcopy(output) for _ 1:nthreads() ]
    @sync for thread_id in 1:nthreads() 
        Threads.@spawn for i in 1:ntasks
            output_threaded[thread_id] = inner_loop!(...) 
        end
    end

This is wrong, isn’t it? I can have two threads writing to the same output_threaded position in this case, wright?

It seems to me that I would need something like:

    output_threaded = [ deepcopy(output) for _ 1:nthreads() ]
    @sync for i in 1:ntasks
        Threads.@spawn begin
            thread_id = threadid()
            output_threaded[thread_id] = inner_loop!(...) 
        end
    end

but I know that is is not a good pattern for the reasons you pointed above.

(@tkf the parallel code does not allocate. But I will separate it to make things clearer)

Thanks @ImreSamu and @johnh. Those are good advices. It is a private cluster, so I will try these suggestions.

carstenbauer · September 10, 2021, 2:12pm

Note that we have Hwloc.jl, which also downloads the library via Hwloc_jll.jl.

Depending on the details of the machine, just setting JULIA_EXCLUSIVE=1 might do the job (i.e. pin julia threads to physical cpus). Otherwise, numactl and likwid-pin could be options, see this discussion for more information.

dlakelan · September 10, 2021, 2:30pm

lmiq:

    output_threaded = [ deepcopy(output) for _ 1:nthreads() ]
    @sync for thread_id in 1:nthreads() 
        Threads.@spawn for i in 1:ntasks
            output_threaded[thread_id] = inner_loop!(...) 
        end
    end

no that’s not what he said. it’s more like:

    output_threaded = [ deepcopy(output) for _ 1:nthreads() ]
    @sync for thread_id in 1:nthreads()  
        for i in 1:ntasks
        Threads.@spawn
            output_threaded[i] = inner_loop!(give_me_work_for_task(i)) 
        end
    end

lmiq · September 10, 2021, 2:38pm

Yes, I got that. But that does not work in my case, because output_threaded is a vector of length nthreads(), not ntasks.

lmiq · September 10, 2021, 2:42pm

Interesting, this is what I get from topology():

julia> topology()
Machine (31.97 GB)
    Package L#0 P#0 (15.97 GB)
        NUMANode (15.97 GB)
        L3 (20.0 MB)
            L2 (256.0 kB) + L1 (32.0 kB) + Core L#0 P#0
                PU L#0 P#0
                PU L#1 P#16
            L2 (256.0 kB) + L1 (32.0 kB) + Core L#1 P#1
                PU L#2 P#1
                PU L#3 P#17
            L2 (256.0 kB) + L1 (32.0 kB) + Core L#2 P#2
                PU L#4 P#2
                PU L#5 P#18
            L2 (256.0 kB) + L1 (32.0 kB) + Core L#3 P#3
                PU L#6 P#3
                PU L#7 P#19
            L2 (256.0 kB) + L1 (32.0 kB) + Core L#4 P#4
                PU L#8 P#4
                PU L#9 P#20
            L2 (256.0 kB) + L1 (32.0 kB) + Core L#5 P#5
                PU L#10 P#5
                PU L#11 P#21
            L2 (256.0 kB) + L1 (32.0 kB) + Core L#6 P#6
                PU L#12 P#6
                PU L#13 P#22
            L2 (256.0 kB) + L1 (32.0 kB) + Core L#7 P#7
                PU L#14 P#7
                PU L#15 P#23
    Package L#1 P#1 (16.0 GB)
        NUMANode (16.0 GB)
        L3 (20.0 MB)
            L2 (256.0 kB) + L1 (32.0 kB) + Core L#8 P#0
                PU L#16 P#8
                PU L#17 P#24
            L2 (256.0 kB) + L1 (32.0 kB) + Core L#9 P#1
                PU L#18 P#9
                PU L#19 P#25
            L2 (256.0 kB) + L1 (32.0 kB) + Core L#10 P#2
                PU L#20 P#10
                PU L#21 P#26
            L2 (256.0 kB) + L1 (32.0 kB) + Core L#11 P#3
                PU L#22 P#11
                PU L#23 P#27
            L2 (256.0 kB) + L1 (32.0 kB) + Core L#12 P#4
                PU L#24 P#12
                PU L#25 P#28
            L2 (256.0 kB) + L1 (32.0 kB) + Core L#13 P#5
                PU L#26 P#13
                PU L#27 P#29
            L2 (256.0 kB) + L1 (32.0 kB) + Core L#14 P#6
                PU L#28 P#14
                PU L#29 P#30
            L2 (256.0 kB) + L1 (32.0 kB) + Core L#15 P#7
                PU L#30 P#15
                PU L#31 P#31

It seems that in this case it would be reasonable to choose threadas from 0 to 7, which would all share the same memory (I guess this is good in this case, since the threads have to access shared arrays of data). Makes sense?

dlakelan · September 10, 2021, 2:47pm

but ntasks = nthreads

The point of this pattern is to @spawn off one task per available thread, but to give the task a STATIC identifier that doesn’t change as the scheduler moves the task from thread to thread, thereby giving it a fixed place in memory to store its private computational info.

carstenbauer · September 10, 2021, 3:00pm

Yes, I would think so. As you can see, and as is typically the case, the P numbering of the logical processing units (PU) is so that the first 0-N are in different cores and N+1:2N are the “hyperthread units”. So, as I said above, often you should get away with just setting JULIA_EXCLUSIVE=1 which, IIRC, will pin N Julia threads to the the first N cores / processing units.

(Note that I’m writing this down from the top of my head, so it might be good to check again )

lmiq · September 10, 2021, 3:01pm

Uhm… admittedly I don’t understand how those patterns are delivering that, and how that can avoid two tasks to write over the same output concurrently. I’ll have to think about that more deeply…