Poor performance on cluster multithreading

I am currently running Julia on a computing cluster with 56 threads.
To test the multithreading capacity, I am generating random numbers:

function test()
tic()
Threads.@threads for i=1:100000
rand(1000000)
end
toc()
end

and measure the computing time for different amounts of threads (changed through “export JULIA_NUM_THREADS=”):

#Threads Computing time in s
1 234.69
4 82.99
14 38.15
16 35.85
28 35.05
40 39.25
56 87.07

Does anyone have an idea what could cause this behavior?

Thanks in advance

You’re using a global RNG object. Instead, make local RNGs and pass them to rand. Maybe that helps?

2 Likes

Additionally using rand without local RNG will produce wrong results. See last part of section https://docs.julialang.org/en/latest/manual/parallel-computing/#The-@threads-Macro-1 in the Julia manual (on Julia 0.6.3 the example should read a bit differently as randjump produces a vector, but the idea is the same - random number generation is not thread safe as rand does not perform locking of global random number generator).

The slowdown - in general - is probably due to false sharing as each thread invalidates cache of other threads. But this is a secondary issue - the bigger problem is that simply this code will not produce correct results if I am not missing something. The additional reason can be that most probably your cluster has 56 virtual cores and only 23 physical cores and using two virtual cores on one physical core can lead to performance degradation.

2 Likes

You mean for example adding srand(i) inside the loop to create local RNGs? I tried that and still experienced the same behavior.

You are right, but this function is not actually part of the project that I am working on, so I wasnt thinking about the results.
I just wanted to create a simple example that should be trivially multi threadable.
So I created a small function with random numbers and noticed that it has the same performance behavior as the main algorithm that I am using, where lots of threads lead to bad performance.

Do you have an idea why the false sharing could appear? Because from my understanding, the threads should be able to perform completely separate.

Reply to your updated post:
I had that idea as well after I made this post, but it has 28 physical cores and performes best at around 24.
For the “real” algorithm that I am using it actually has the fastest speed with only 4 threads, which confused me a lot.

srand(i) does not create local RNGs. That just seeds the global RNG.

1 Like

How many allocations happen in the real algorithm?

1 Like

Thats hard to tell since I am calculating an irrational number numerically, so I never really get to an end and the amount of digits increases over time.
I let the algorithm run for 32s and got the following results with the @time function:
32.894087 seconds (140.62 M allocations: 3.837 GiB, 5.24% gc time)

There are two issues:

  1. understanding false sharing: the simplest way to get an intuition when this can happen is when two threads modify regions of memory that are close to each other; this does not have to be the same memory address - it is enough that distance in memory location is close enough that it falls into the same “chunk” of memory that is kept in CPU cache; in such a situation when thread 1 changes memory address A then thread 2 that uses memory address B that is close to address A gets its cache invalidated and this causes performance regression;
  2. in general your computer has many different resources that can be a performance bottleneck (CPU, memory, disk IO etc.) - it seems that simply other resource than CPU is a bottleneck in your code (assuming the code is correct otherwise :slight_smile:).
3 Likes

What I’ve found from experience is that any allocations at all (unless trivial i.e. in the kb range) is absolutely detrimental for performance, because gc is still not multithreaded. So even though you might see 5% it might actually impact things more as far as I understand it. To quickly check whether this is the issue, try checking the scaling with gc_enable(false). If there is an improvement you should try to work with some caches (1 per thread that get allocated before the loop, and are used in calculations during the loop).
By eliminating all but a couple of kb of allocations I managed to get almost linear scaling up to 64 cores for some FEM calculations.

3 Likes

In this case multiple threads are accessing the same block of memory. It’s true sharing instead of false sharing.

5% means just that, 5% time in GC, you get no more than 5% off if you disable GC. That said, allocation can easily kill peformance for a number of reasons. It’s just unrelated to GC being multithreaded or not.

1 Like

Thank you very much for the advise!
I might have underestimated the impact of allocations on the performance. I will read through the performance optimization guide and try to avoid allocations as much as I can.

Agreed. Sorry for being imprecise. I meant that even if this is fixed using randjump to get separate generators under Julia 0.6 there is a risk of false sharing from my experience.

1 Like

Correct, and it’s actually measureable. The different RNG’s have to be allocated on each threads instead of allocating on the same thread.

2 Likes

You can solve the problem like this:

using Compat, Compat.Random

const twisters = [MersenneTwister() for i ∈ 1:Threads.nthreads()];

#Multithreaded.
function tf!(x::Vector{Matrix{Float64}},N::Int64)
    Threads.@threads for ii=1:N
        id = Threads.threadid()
        twister = twisters[id]
        @inbounds x_thd = x[id]
        for nn=1:100
            for mm=1:100
                @inbounds x_thd[mm,nn] += randn(twister)
            end
        end
    end
    return nothing
end
3 Likes

This will actually have false sharing. You need to allocate the rng from each threads.

3 Likes

Actually there are two problems with this code:

  1. MersenneTwister() does not ensure that PRNGs are non-overlapping (although the risk of overlap is very small - but this is the reason why randjump is provided)
  2. The simplest you can do is either do deepcopy in each thread an element used by this thread of an array of initiated PRNGs in a single thread or use a “separator” object (one PRNG that is discarded to separate data in threads) similarly to what I proposed in KissThreading.jl/KissThreading.jl at master · mohamed82008/KissThreading.jl · GitHub. The discarding approach might not be best if you have NUMA issues in your infrastructure.
2 Likes

I see, then I was mistaken. For some reason though, I also remember that in my case I had allocations supposedly taking up around 5-10%, but reducing them yielded a huge speedup. It felt as if when the gc hit, it would go through all the small chunks that each of the 64threads allocated separately (going through the small chunks that were allocated each loop iteration rather than doing each thread’s total allocated chunk), thus taking a very long time. Not sure if that was exactly what was going on though, I might be completely wrong.

1 Like

This is correct. However, that’s exactly what takes 5% of the time!

Reducing allocation can certainly speed things up by a lot. However, at most 5% of those will come from the GC in this case. It’s correct to say that the GC percentage is not an accurate representation of how much ALLOCATION is hurting you, it is never meant to be and isn’t even called that, but the reason for that is unrelated to what GC does.

2 Likes