Poor performance on cluster multithreading

Thank you very much for the advise!
I might have underestimated the impact of allocations on the performance. I will read through the performance optimization guide and try to avoid allocations as much as I can.

Agreed. Sorry for being imprecise. I meant that even if this is fixed using randjump to get separate generators under Julia 0.6 there is a risk of false sharing from my experience.

1 Like

Correct, and it’s actually measureable. The different RNG’s have to be allocated on each threads instead of allocating on the same thread.

2 Likes

You can solve the problem like this:

using Compat, Compat.Random

const twisters = [MersenneTwister() for i ∈ 1:Threads.nthreads()];

#Multithreaded.
function tf!(x::Vector{Matrix{Float64}},N::Int64)
    Threads.@threads for ii=1:N
        id = Threads.threadid()
        twister = twisters[id]
        @inbounds x_thd = x[id]
        for nn=1:100
            for mm=1:100
                @inbounds x_thd[mm,nn] += randn(twister)
            end
        end
    end
    return nothing
end
3 Likes

This will actually have false sharing. You need to allocate the rng from each threads.

3 Likes

Actually there are two problems with this code:

  1. MersenneTwister() does not ensure that PRNGs are non-overlapping (although the risk of overlap is very small - but this is the reason why randjump is provided)
  2. The simplest you can do is either do deepcopy in each thread an element used by this thread of an array of initiated PRNGs in a single thread or use a “separator” object (one PRNG that is discarded to separate data in threads) similarly to what I proposed in KissThreading.jl/KissThreading.jl at master · mohamed82008/KissThreading.jl · GitHub. The discarding approach might not be best if you have NUMA issues in your infrastructure.
2 Likes

I see, then I was mistaken. For some reason though, I also remember that in my case I had allocations supposedly taking up around 5-10%, but reducing them yielded a huge speedup. It felt as if when the gc hit, it would go through all the small chunks that each of the 64threads allocated separately (going through the small chunks that were allocated each loop iteration rather than doing each thread’s total allocated chunk), thus taking a very long time. Not sure if that was exactly what was going on though, I might be completely wrong.

1 Like

This is correct. However, that’s exactly what takes 5% of the time!

Reducing allocation can certainly speed things up by a lot. However, at most 5% of those will come from the GC in this case. It’s correct to say that the GC percentage is not an accurate representation of how much ALLOCATION is hurting you, it is never meant to be and isn’t even called that, but the reason for that is unrelated to what GC does.

2 Likes

That’s why I said you should just allocate the RNG on the thread you are going to use it on. This issue will be automatically taken care of.

1 Like

I see, thank you for the explanation!

1 Like

Time for an experiment:

function runme(N=10^8)
    nt = nthreads()
    mt0 = MersenneTwister(1234)
    rngs = randjump(mt0, nt)
    av = zeros(1024,nt)
    @threads for it=1:nt
        # rng = rngs[it]          # version 1, global alloc
        rng = copy(rngs[it])      # version 2, threadwise alloc
        a=0.0
        for i=1:N
            a += (rand(rng) - 0.5)
        end
        av[1,it] = a
    end
    av[1,:]
end

(Note that the copy method is implemented as a deep copy. Striding av is pointless here, but helpful if it were used in the hot loop.)

On my system, with 8 threads on 8 cores:
Version 1: 1.869167 seconds (61 allocations: 122.984 KiB)

Version 2: 0.422015 seconds (101 allocations: 174.734 KiB)

Version 1 is occasionally faster, but I saw good consistency with Version 2.

I am surprised how much this matters, given the large size of the Mersenne Twister state vector.

2 Likes

I’m sure others have mentioned this, but I believe that rand() is not thread-safe and causes problems if multiple threads try to use rand() unless you specify the RNG separately for each thread.

Take a look at this thread for an example of multi-threaded random number generation.

1 Like

Now you know how much hardware vendors “cheats” and how slow hardwares actually are for serial program accessing memory. Even 1% of >200x slow down can easily give you a significant global slow down.

2 Likes

Cool. I updated the comment in the thread I linked, and (besides the obvious advantage of actually being correct / independently random), saw roughly a 50% speedup using TRNG to avoid false sharing.
This was on a Threadripper, which may have some NUMA issues (two separate dies in one physical processor; latency is higher between them[1]). But the end result was about a 20x improvement over single threaded, which seems really good for 16 physical (32 logical) cores.

Oddly, I see a significant slowdown from hoisting the getindex out of the loop:


function tf!(x::Vector{Matrix{Float64}},N::Int64)
    tmap!(x, x) do x
        id = Threads.threadid()
        n = Threads.nthreads()
        @inbounds for i ∈ 1+(id-1)*N÷n:id*N÷n, j ∈ eachindex(x)
            x[j] += randn(TRNG[Threads.threadid()])
        end
        x
    end
    nothing
end

vs


function tfc!(x::Vector{Matrix{Float64}},N::Int64)
    tmap!(x, x) do x
        id = Threads.threadid()
        n = Threads.nthreads()
        # rng = TRNG[Threads.threadid()]
        rng = copy(TRNG[Threads.threadid()])
        @inbounds for i ∈ 1+(id-1)*N÷n:id*N÷n, j ∈ eachindex(x)
            x[j] += randn(rng)
        end
        x
    end
    nothing
end

while both the version that relies on copy, and the TRNG are about the same fast.
Both are much faster than relying on a regular randjump.

[1] NUMA NUMA: Infinity Fabric Bandwidths - AMD's Future in Servers: New 7000-Series CPUs Launched and EPYC Analysis

1 Like

Coming rather late to the party here, a small plea. when discussing multi-threaded applications, please specify if you are using all real cores, or hyperthreaded cores.
Random number discussions aside, I think @DavidBerghaus nails it with the commen tthat the peak of perfomance is at 24 cores.
I know I am traching my grandmother to suck eggs here, and I actually would appreciate this point being corrected. Hyperthreading to me will only be of benefit in a workload where one computation stalls, so the other computation assigned to that physical core can be ‘switched in’ rapidly. In the past many HPC type codes were known not to benefit from Hyperthreading.

And a second aside, you can test the effects of hyperthreading being switched on or off via a few methods.

a) disable it in the BIOS. OK, that is going to require a reboot and also you may not be able to do that on a remote cluster not under your control

b) In Linux you can disable every odd-numbered processor: echo 0 > /sys/devices/system/cpu/cpuN/online

c) On Linux use numactl to ‘pin’ processes to particular CPU cores

2 Likes

Interesting, the situation I described above happened on a server with 2 epyc 32 core cpus. That could be the reason why I saw such a drastic improvement when optimizing for allocations, rather than just the 5% as would be expected by just getting the gc out of the way.

1 Like

On the hyperthreading, if say I have 128 threads, 64 cores, but tell julia to only use 64 threads, will it automatically distribute the threads over physical cores?

1 Like

Good question, I was wondering about that too

Regarding the number of cores, there is an issue open at the moment regarding Sys.CPU_CORES
https://github.com/JuliaLang/julia/issues/13901

I think Sys.CPU_LOGICAL_CORES is a good idea and also Sys.CPU_PHYSICAL_CORES
Maintaining Sys.CPU_CORES as an equal value to Sys.CPU_LOGICAL_CORES is probably justified, but I Can see that leading to Dooooh! moments in the future as people do not appreciate that CPU_CORES is the logical not physical number.

1 Like

I saw your reply here:
https://github.com/JuliaLang/julia/issues/17395#issuecomment-232343762

I am using the default sin / cos functions as well, so this might be the issue.
However, I cannot use your workaround with
sin1(x::Float64) = ccall(:sin, Float64, (Float64,), x)
since I have to work with BigFloats.
When I change Float64 to BigFloat, Julia just crashes.

Do you know another way to fix this?