Poor performance on cluster multithreading

That’s why I said you should just allocate the RNG on the thread you are going to use it on. This issue will be automatically taken care of.

1 Like

I see, thank you for the explanation!

1 Like

Time for an experiment:

function runme(N=10^8)
    nt = nthreads()
    mt0 = MersenneTwister(1234)
    rngs = randjump(mt0, nt)
    av = zeros(1024,nt)
    @threads for it=1:nt
        # rng = rngs[it]          # version 1, global alloc
        rng = copy(rngs[it])      # version 2, threadwise alloc
        a=0.0
        for i=1:N
            a += (rand(rng) - 0.5)
        end
        av[1,it] = a
    end
    av[1,:]
end

(Note that the copy method is implemented as a deep copy. Striding av is pointless here, but helpful if it were used in the hot loop.)

On my system, with 8 threads on 8 cores:
Version 1: 1.869167 seconds (61 allocations: 122.984 KiB)

Version 2: 0.422015 seconds (101 allocations: 174.734 KiB)

Version 1 is occasionally faster, but I saw good consistency with Version 2.

I am surprised how much this matters, given the large size of the Mersenne Twister state vector.

2 Likes

I’m sure others have mentioned this, but I believe that rand() is not thread-safe and causes problems if multiple threads try to use rand() unless you specify the RNG separately for each thread.

Take a look at this thread for an example of multi-threaded random number generation.

1 Like

Now you know how much hardware vendors “cheats” and how slow hardwares actually are for serial program accessing memory. Even 1% of >200x slow down can easily give you a significant global slow down.

2 Likes

Cool. I updated the comment in the thread I linked, and (besides the obvious advantage of actually being correct / independently random), saw roughly a 50% speedup using TRNG to avoid false sharing.
This was on a Threadripper, which may have some NUMA issues (two separate dies in one physical processor; latency is higher between them[1]). But the end result was about a 20x improvement over single threaded, which seems really good for 16 physical (32 logical) cores.

Oddly, I see a significant slowdown from hoisting the getindex out of the loop:


function tf!(x::Vector{Matrix{Float64}},N::Int64)
    tmap!(x, x) do x
        id = Threads.threadid()
        n = Threads.nthreads()
        @inbounds for i ∈ 1+(id-1)*N÷n:id*N÷n, j ∈ eachindex(x)
            x[j] += randn(TRNG[Threads.threadid()])
        end
        x
    end
    nothing
end

vs


function tfc!(x::Vector{Matrix{Float64}},N::Int64)
    tmap!(x, x) do x
        id = Threads.threadid()
        n = Threads.nthreads()
        # rng = TRNG[Threads.threadid()]
        rng = copy(TRNG[Threads.threadid()])
        @inbounds for i ∈ 1+(id-1)*N÷n:id*N÷n, j ∈ eachindex(x)
            x[j] += randn(rng)
        end
        x
    end
    nothing
end

while both the version that relies on copy, and the TRNG are about the same fast.
Both are much faster than relying on a regular randjump.

[1] NUMA NUMA: Infinity Fabric Bandwidths - AMD's Future in Servers: New 7000-Series CPUs Launched and EPYC Analysis

1 Like

Coming rather late to the party here, a small plea. when discussing multi-threaded applications, please specify if you are using all real cores, or hyperthreaded cores.
Random number discussions aside, I think @DavidBerghaus nails it with the commen tthat the peak of perfomance is at 24 cores.
I know I am traching my grandmother to suck eggs here, and I actually would appreciate this point being corrected. Hyperthreading to me will only be of benefit in a workload where one computation stalls, so the other computation assigned to that physical core can be ‘switched in’ rapidly. In the past many HPC type codes were known not to benefit from Hyperthreading.

And a second aside, you can test the effects of hyperthreading being switched on or off via a few methods.

a) disable it in the BIOS. OK, that is going to require a reboot and also you may not be able to do that on a remote cluster not under your control

b) In Linux you can disable every odd-numbered processor: echo 0 > /sys/devices/system/cpu/cpuN/online

c) On Linux use numactl to ‘pin’ processes to particular CPU cores

2 Likes

Interesting, the situation I described above happened on a server with 2 epyc 32 core cpus. That could be the reason why I saw such a drastic improvement when optimizing for allocations, rather than just the 5% as would be expected by just getting the gc out of the way.

1 Like

On the hyperthreading, if say I have 128 threads, 64 cores, but tell julia to only use 64 threads, will it automatically distribute the threads over physical cores?

1 Like

Good question, I was wondering about that too

Regarding the number of cores, there is an issue open at the moment regarding Sys.CPU_CORES
https://github.com/JuliaLang/julia/issues/13901

I think Sys.CPU_LOGICAL_CORES is a good idea and also Sys.CPU_PHYSICAL_CORES
Maintaining Sys.CPU_CORES as an equal value to Sys.CPU_LOGICAL_CORES is probably justified, but I Can see that leading to Dooooh! moments in the future as people do not appreciate that CPU_CORES is the logical not physical number.

1 Like

I saw your reply here:
https://github.com/JuliaLang/julia/issues/17395#issuecomment-232343762

I am using the default sin / cos functions as well, so this might be the issue.
However, I cannot use your workaround with
sin1(x::Float64) = ccall(:sin, Float64, (Float64,), x)
since I have to work with BigFloats.
When I change Float64 to BigFloat, Julia just crashes.

Do you know another way to fix this?

No you are not seeing the same issue.

I think the sin/cos are actually the issue for my main algorithm not to perform properly on multithreading.
I excluded them from the algorithm and finally got the behavior where more threads lead to faster speed.

But you dont know if there is a way for BigFloats?
I could try to write an algorithm for sin/cos on my own, but they would probably not get the same speed as the default ones…

I somehow don’t understand all of the issue. We have two potential problems: (1) true sharing: Thread A writes to addr1, thread B reads from addr2, and both are on the same cache line. Yeah, that’s bad and will definitely kill performance. (2) False sharing: The processor only looks at a couple of bits of the address in order to decide where to put it in the cache. These can collide.

Now, what happens for the rng? Every thread needs to grab its mersenne twister. This is a mutable and a half-line (32 bytes); its mt.idx will need to get read and written to a lot. If the mersenne twister-structs are on adjacent cache lines, we will get into trouble. Next, we have the buffer (mt.vals) and the state (mt.state); we will need to access both of them a lot. These are vectors; so we need to hit the vector-struct and the actual buffer. So if any of these guys collide, we are unhappy.

Let’s look at the addresses:

julia> function fct()
       mt0 = MersenneTwister(0)
       rngs = randjump(mt0, 10)
       end
julia> rngs=fct();
julia> pointer_from_objref.(rngs)
10-element Array{Ptr{Void},1}:
 Ptr{Void} @0x00007f1d0b247dc0
 Ptr{Void} @0x00007f1d0b247e50
 Ptr{Void} @0x00007f1d0b247eb0
 Ptr{Void} @0x00007f1d0b247ee0
 Ptr{Void} @0x00007f1d0b247f10
 Ptr{Void} @0x00007f1d0b247f40
 Ptr{Void} @0x00007f1d0b247fa0
 Ptr{Void} @0x00007f1d0b247fd0
 Ptr{Void} @0x00007f1d0b264010
 Ptr{Void} @0x00007f1d0b264070

Urgh. That doesn’t look nice. For example, the counter (mt.idx) and buffer (mt.val) of the following will share a cache line:

 Ptr{Void} @0x00007f1d0b247eb0
 Ptr{Void} @0x00007f1d0b247ee0

So, in summary, randjump should be rewritten or MersenneTwister should be padded to 64 bytes and allocated via posix_memalign use a ccall of some sort for creation? (nothing should share a cache line with a global counter like MersenneTwister, ever; and, somehow even 64-byte length mutable structs can get split over multiple cache lines?)

None of above. That’s why I repeatedly say that you should allocate on each thread. That already guarantee different thread won’t be in the same cache.

So the general solution for thread-local mutable state would be

using Compat
using Base.Threads
function rng_init()
rngs = Vector{MersenneTwister}(undef, nthreads())
mt0=MersenneTwister(0)
@threads for i=1:nthreads()
rng=randjump(mt0,1)
rngs[i] = rng[1]
end
return rngs
end

so that the rngs can be re-used?

1 Like

You can use a global vector. Yes that shouldn’t have false sharing issue (edit: it could in principle have false sharing due to cache line affinity limit but that’s a completely different issue…). Not sure what you mean by “re-used”.

Not sure what you mean by “re-used”.

The previous solution looked like one would create new MersenneTwisters during each split (@threads) of program flow.

OK, sure. I assumed that it was written that way for easy testing but yes, the copying in the loop isn’t the point, allocating on the thread is.

1 Like