Decrease in performance using Threads.@threads in Linux

I am attempting to parallelize this code

function pdiff(b1::Array{Basic,1}, b2::Basic)
    dfdx = Array{Basic}(undef,length(b1))
    Threads.@threads for i = 1:length(b1)
        a = Basic()
        ret = ccall((:basic_diff, libsymengine), Int, (Ref{Basic}, Ref{Basic}, Ref{Basic}), a, b1[i], b2)
        dfdx[i] = a
    end
    return dfdx
end

On my laptop, the same Julia version, but runs on Windows and Intel skylake CPU. I saw an O(n^1/2) improvement with the number of threads, but on my desktop, I saw a decrease in performance with adding cores.
this might be helpful
desktop:

julia> versioninfo()
Julia Version 1.1.1
Commit 55e36cc308 (2019-05-16 04:10 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, haswell)
Environment:
  JULIA_NUM_THREADS = 44

image

Any help would be greatly appreciated.

You can see the performance here https://github.com/symengine/SymEngine.jl/issues/166#issuecomment-511548272

Do you really have 24 cores? This spec sheet: https://ark.intel.com/content/www/us/en/ark/products/81908/intel-xeon-processor-e5-2680-v3-30m-cache-2-50-ghz.html
Says your cpu only has 12 cores, so 24 threads?

Did you run top or htop while running? Did all the threads hit 100% or close to 100%?

I will stick my neck out here. As @pixel27 says you should look at whether or not hyperthreading is enabled or disabled on your machine. (By you I mean everyone in the audience here).
In HPC the tradiational message is that hyperthreading does not help performance.
If you think about it, in a good code a CPU core should be running at 100% (or near that). Context switching to another task is only going to hinder things.
Put crudely, HT is good when there are bursty workflows - such as me typing right now. In the gaps between the letters being input and rendered the CPU can switch to other tasks.
However Your Mileage May Vary - so as the original poster has done you do benchmarking.
Another tip from HPC - use the NUMA tools to pin processes to CPUs. Do not rely on the OS scheduler to schedule tasks - if the task is moved about you get stalls and cache flushes.

The other thing to look at with a threaded program is how much ‘work’ each thread is getting. You need to send a ‘decent’ amount of work to each thread.
Micro-optimising by threading a very tight inner loop might actually hinder performance.
Again in crude terms you need to give the CPU core a decent amount of work and let it get on with it - and make sure the other CPU cores have their work to do.

2 Likes

Looking at @amcostant code I would say the @threads is in the ‘correct’ place.
However there is a ccall in there - I am absolutely not experienced enough to comment on how Julia interacts with C calls in this area.
Expert to the bridge please.

@pixel27 I saw all the threads hitting close to 100%


I have 2 physical CPU’s in my computer.

@johnh would you suggest I turn off hyperthreading? My laptop has only 2 cores but improves performance when using 4 threads vs 2 threads. Also, I was the one who posted those benchmarking results in the linked thread, sorry if that was not clear. On the topic of NUMA, how would I pin the Julia processes to the CPU with NUMA tools, thanks.

I thought this might help:

(base) adrian@adrian-WS:~$ numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 24 25 26 27 28 29 30 31 32 33 34 35
node 0 size: 64316 MB
node 0 free: 56472 MB
node 1 cpus: 12 13 14 15 16 17 18 19 20 21 22 23 36 37 38 39 40 41 42 43 44 45 46 47
node 1 size: 64506 MB
node 1 free: 55829 MB
node distances:
node   0   1 
  0:  10  21 
  1:  21  10 
(base) adrian@adrian-WS:~$ lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              48
On-line CPU(s) list: 0-47
Thread(s) per core:  2
Core(s) per socket:  12
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
CPU family:          6
Model:               63
Model name:          Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Stepping:            2
CPU MHz:             1200.222
CPU max MHz:         3300.0000
CPU min MHz:         1200.0000
BogoMIPS:            5000.45
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            30720K
NUMA node0 CPU(s):   0-11,24-35
NUMA node1 CPU(s):   12-23,36-47
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm cpuid_fault epb invpcid_single pti intel_ppin ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm xsaveopt cqm_llc cqm_occup_llc dtherm ida arat pln pts md_clear flush_l1d
(base) adrian@adrian-WS:~$

The CPU history graph is very interesting…Do you know what the trough’s are? Where those caused by sequential runs? Or did they occur during a single run?

If it’s a single run, then it might be some sort of synchronization issue like one of the basic_diff calls is taking significantly longer than the others so the threads mostly go idle waiting for that one call to complete.

Something I would try would be something like this:

function pdiff(threads::Int, b1::Array{Basic,1}, b2::Basic)
    dfdx = Array{Basic}(undef,length(b1))
    Threads.@threads for i = 0:threads-1
        for j = i*threads:max((i+1)*threads, length(b1))
            a = Basic()
            ret = ccall((:basic_diff, libsymengine), Int, (Ref{Basic}, Ref{Basic}, Ref{Basic}), a, b1[j+1], b2)
            dfdx[j+1] = a
       end
    end
    return dfdx
end

I’m not sure if the @threads breaks the array up in this way or not. But this will give each thread a “chunk” of the array to process and any synchronization should happen at the end. The other thing to possibly try is:

function pdiff(threads::Int, b1::Array{Basic,1}, b2::Basic)
    dfdx = Array{Basic}(undef,length(b1))

    local offsets = Channel() do c
       for i in 1:length(b1)
          put!(c, i)
       end
    end    

    Threads.@threads for i = 1:threads
        for j in offsets
            a = Basic()
            ret = ccall((:basic_diff, libsymengine), Int, (Ref{Basic}, Ref{Basic}, Ref{Basic}), a, b1[j], b2)
            dfdx[j] = a
        end
    end
    return dfdx
end

Please note: Both these methods are trying to “out think” whatever @threads normally does to divide up the work which generally should be avoided. Also I haven’t tested these so I might have messed up the logic and/or fat fingered the code.

The first one should work well if the length of b1 is evenly divisible by the number of threads, and each call to basic_diff takes on average the same amount of time.

The second one should work better if the times taken by basic_diff are wildly different and hopefully one of those long ones isn’t at the end of the list. If the durations of basic_diff are wildly different you might also try reversing the order the numbers are placed in the channel, or even randomizing them.

1 Like

@pixel27 The gaps were from multiple runs, not breaks during running the command.
Both of the suggested codes do not work, the first one doesn’t stop running, and the other segfaults.
I tried to see if each core having its own array to work off would help, but it was about 1.8x slower compared to single-core broadcasting. I know I am not returning anything right now but I wanted to see if the total computation time decreased. currently running on 40 threads.

function pdiff(b1::Array{Basic,1}, b2::Basic)
    threads = Threads.nthreads()
    array_size = ceil(Int,length(b1)/threads)
    dfdx = Array{Basic}(undef,length(b1))
    len_b1 = length(b1)
    Threads.@threads for n = 1:threads
        b1_temp = b1[(n-1)*array_size+1 : n*array_size]
        b2_temp = b2
        dfdx_temp = Array{Basic}(undef,length(b1_temp))
        for i in 1:length(b1_temp)
            a = Basic()
            ret = ccall((:basic_diff, libsymengine), Int, (Ref{Basic}, Ref{Basic}, Ref{Basic}), a, b1_temp[i], b2_temp)
            dfdx_temp[i] = a
        end
    end
end

and as expected this preformed identically

function pdiff1(b1::Array{Basic,1}, b2::Basic)
    threads = Threads.nthreads()
    array_size = ceil(Int,length(b1)/threads)
    dfdx = Array{Basic}(undef,length(b1))
    len_b1 = length(b1)
    Threads.@threads for n = 1:threads
        b1_temp = b1[(n-1)*array_size+1 : n*array_size]
        b2_temp = b2
        dfdx_temp = diff.(b1_temp,b2_temp)
    end
end

@amcostant I recently was asked to do a benchmarking exercise - I could not switch off hyperthreaeidng on my laptop in the BIOS. This surprised me as I am more accustomed to working with Xeon cpus on servers.

Regarding CPU pinning I should have made it clear I was referring to processes and not threads.
Pinning threads - ah… yes… I plead the fifth amendment… or would do so if I was a US citizen.

@johnh Im sorry, but I have not worked with NUMA tools before so I was very confused what you meant. If you could explain the command that would help that would be great.
I turned off hyperthreading and saw an increase in time of about 1.4x for single thread broadcasting, and about 1.5x increase in time when using the parallel code at the top of the page using “JULIA_NUM_THREADS=20” so same number of cores but no hyperthreading.
also, the CPU workload looks like this:


ignore the part before all the cores startup, it was from loading Julia.

I noticed an improvement with all of the cores of one socket turned off, even though it had fewer cores it ran faster, both broadcast and parallel. I think there may be an issue with how the memory is stored/shared between the two sockets which is causing delays.

1 Like

Hello @amcostant
To begin with let us look at the physical layour of your server. I would advise everyone to use this very useful utility: https://www.open-mpi.org/projects/hwloc/
Assuming you are using Linux use yoru package manager to install hwloc and hwloc-gui

The run this utility: lstopo
This wil show the layout of CPU cores, the caches , the graphics devices and IO Cards.
Very useful!

2 Likes

Then install this package: numactl
The command numactl --hardware will show the relationship between the NUMA nodes on your system.
In your case it should be a two socket system, therefore there are two NUMA nodes.

(base) adrian@adrian-WS:~$ numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 24 25 26 27 28 29 30 31 32 33 34 35
node 0 size: 64337 MB
node 0 free: 58387 MB
node 1 cpus: 12 13 14 15 16 17 18 19 20 21 22 23 36 37 38 39 40 41 42 43 44 45 46 47
node 1 size: 64485 MB
node 1 free: 56791 MB
node distances:
node   0   1 
  0:  10  21 
  1:  21  10 
(base) adrian@adrian-WS:~$

Is there any other information that you would need?

Oh, that’s interesting, so it sounds like whatever synchronization is happening because it has to do it across CPUs is actually hurting performance.

Something to try would be running julia using no more than 24 threads with the taskset command to bind the process to a CPU…I’m assuming you would get comparable performance to when you disabled one of the CPUs.

Something you could try is using using Distributed Processing model to create 48 or so processes and distribute the load that way…Or you could try to create only 2, each with 24 threads, and split the load in half…but you might need to use taskset to bind each process to it’s own CPU…

Sounds like fun…maybe I need to look into building a multi-cpu rig…

I think it might be an issue with either SymEngine or ccall and how it stores the memory as the performance increases with additional threads when running the CPU part of this , https://juliagpu.gitlab.io/CuArrays.jl/tutorials/generated/intro/ . I would not be surprised that ccall or SymEngine is treating the RAM as one block instead of being NUMA aware.

@amcostant I am just introducing some tools which show the logical layout of a system. I think most people love the output of hwloc!
Try running it on a single socket system, such as your laptop. You will only see one NUMA domain.
@pixel27 has beaten me to it with the next step - using taskset to pin processes/threads

I would look at numactl also. You were suspecting that there is a performance loss perhaps due to accessing memory attached to the other socket.
The numactl options can be tried here
localalloc always allocate on the local node
preferred=node preferably allocate on node

And another useful command numastat
Show per-NUMA-node memory statistics for processes and the operating system

My intuition says that you are on the right track by asking what ccall does and how that affects the performance. However the scientist in me says Intuition.jl is not a good package to work with.