Poor performance while multithreading (Julia 1.0)

Hello everyone,

this is my first post (although not my first encounter with Julia), so please bear with me if I am doing everything wrong.

Okay, so now here is my Problem. I have a working code, that solves a set of coupled differential equations. For small system sizes it does so very efficiently, but for large systems the number of derivatives to compute grows very large arising the need for parallelization. Fortunately each element of the set of equations can be computed independently of the other. In principle my codes spends most of its time in the following code snippet

function computeStep!(Derivative :: MyStruct1, Parameters :: MyStruct2) :: Nothing
          
             for Iterator in Parameters.Iterators1
                  Derivative.Kernel1[Iterator] = computeKernel1(Iterator, Parameters)
            end 

             for Iterator in Parameters.Iterators2
                  Derivative.Kernel2[Iterator] = computeKernel2(Iterator, Parameters)
            end 

end

The Kernels are then updated using the derivatives struct and standard RK4. I have implemented a parallelization using MPI.jl (which works on my remote machine, but should also for multiple nodes), but since the Kernels are memory hungry, distributed memory seems unfeasable. Instead I was tempted to use Julias multithreading tool via the @threads macro and only use MPI between different nodes (making use of the shared memory of the nodes). However performance for multithreaded loops is roughly a factor ~1.5 lower than for a single threaded loop. Memory consumption also grows exceedingly large. Even if I wrap the two different Kernel loops into separate functions the issue does not disappear. I also tried are more fine grained split of the iterator sets, such that each thread operates only on some part of the Kernel - also without any success.

My current Julia version is 1.0. For benchmarking I used the BenchmarkTools and the @benchmark macro.

Any ideas what is going wrong here?

EDIT: The workaround seems to be the usage of FastClosures.jl. May anyone give an example of how to use this for the example given above where the loops are decorated with Threads.@threads?

What does Threads.nthreads() output ?

Just in case, hardware multi-threading is enabled by exporting the JULIA_NUM_THREADS` environment variable.

Hello zgornel,

Im well aware of that. Threads.nthreads() outputs 4, since I did export JULIA_NUM_THREADS=4 (my computer has 4 physical cores)

Are you perhaps running into the dreaded issue 15276? That issue in particular has been known to cause performance issues due to the closures created by the @threads macro (e.g. Slowdown when multi-threading is used or Type inference for parallel computing or `@threads` with `@simd` performance regression on `0.7.0-alpha` - #2 by raminammour )

Thanks for the recommendation. I will look into that topic.

So it seems like the workaround here is to use the FastClosures.jl package. Now I wonder about the correct use of this in the context of a loop decorated with @threads. Anybody has an example?

@rdeits I have done a few more checks on the types of the respective functions involved in that part of the code using @code_warntype. Turned out, that even if @threads is added no type is boxed or anything, so I am not sure if that issue applies to my case.

How slow is @btime computeStep!? How large are your structures? Are you allocating intermediate values in computeStep!? How much data do you touch (relevant with respect to cpu cache)? Are you at risk of writing into the same cache-line from different threads?

@foobar_lv2

Let me break my answer into pieces according to your questions.

For my test parameters @btime for no threads gives me 2.89 s, for 2 threads 4.11s, for 3 threads 3.92 s and for 4 threads 3.71 seconds (my machine has 4 physical cores). So although there seems to be some scaling with Threads.nthreads() I am not getting in range of the non-threaded loops.

The structs consist of 3 Arrays. Field1 has ~10^2 entries atmost, in the tests it hast like 10 entries. The other contain up to 10^7 entries, in the test its roughly a thousand each.

struct MyStruct 
Field1 :: Vector{Float64}
Field2 :: Array{Float64, 4}
Field3 :: Array{Float64, 4}
end

The computeKernel() functions compute one dimensional integrals, so inside of these I define inner functions which are passed as kernels to some integration function. In that process I allocate memory temporarily for a buffer that saves intermediate results.

Within the calculation elements of one struct are used to compute elements of the other struct, so there is in fact quite a lot of data access (so cache might be a concern?! ).

Not sure how I test if two threads write into the same cache line… Only thing I can say that the loops are definitely threadsafe, in the sense that no chunck of memory is simultaneously accessed for both reading and writing.

Ok. Then my concerns are probably unfounded: Your problem is indeed large enough to profitably split (if your answer had been 10 us, I would have doubted), and if your code is “definitely threadsafe” (instead of “nonobviously threadsafe”), then this is probably not an issue either. A rule of thumb would be: imagine threadsafety worked differently, and each write would taint the surrounding ~60 bytes in both direction for all other threads, both for reads and writes, until your control flow is reunited. In order to figure out adjacent memory, you need to convert cartesian to linear indices. The most deadly pattern would be e.g. a 4xN matrix, where thread k reads and writes into M[k, :]: Obviously threadsafe for humans, very non-obviously threadsafe for your poor CPU that will run in circles.

What about allocations? Can you post the verbatim output of @btime for 1-4 threads? Both number and size of allocs can tell us a lot about your problem.

Is your code using BLAS, like matrix products? Blas is already multithreaded, and adding an outer @threads will mess with its clever ways of using your L3 cache (shared between cores, so you want all of your cores to work on the same medium-ish submatrix at the same time). You check that by e.g. monitoring whether your “single threaded” code uses multiple cores, e.g. via top.

Regarding your point on “non-obvious” threadsafety. Currently when I perform the loop, the entries that are filled in the result buffer (the Derivative.Kernel ) I do not access it linearly, but by its cartesian coordinates – so I guess that might cause problems with memory acces (?). What I mean is

for Iterator in Parameters.Iterators2
     Derivative.Kernel2[:, Index1(Iterator), Index2(Iterator), Index3(Iterator)] = computeKernel2(Iterator, Parameters)
end

The thing here is that the return of computeKernel2 is of type Vector{Float64}, so to me it was not obvious how to access Derivative by linear indices only.

So there are some BLAS routines involved, but that happens before the performance critical part … I also dont see more than core working without threads.

If Derivative.Kernel2::Array{Float64, 4} and computeKernel2(Iterator, Parameters)::Vector{Vector{Float64}}, then your assignment cannot work. Did you mistype?

That being said, this kind of assignment is probably ok: the destination of your assignment is a contiguous region of memory (assuming that index1, etc are integers and not ranges).

In my previous post I mistyped this, mea culpa.

Yes these Index1 etc are integers that depend on the outer loop iterator in some way.

Ill do some further testing with @btime for different parameter sets and post the output here if I have it. Just from the top of your head, anything else that I could try?

Your @threads loop is probably pretty small, and each of the computeKernel2(Iterator, Parameters) is expensive? Try making it @noinline. This increases readability of the @code_warntype, @code_llvm and @code_native. Try running your code on a current master. Otherwise, I’m out of immediate ideas.

Well I think all the computeKernel() calls are pretty cheap its only their sheer mass (so the size of the loop) that make the calculations expensive. Ill try the what you’ve said - many thanks.

Also note that you can use the profiler with multithreaded code; it will get confused about some call-tree connections, but should point to the proper text. Use the C=true option when printing the profile; this will show whether the GC is getting called - that can stall the threads and may be a sign that you need to give the compiler more help.

I ran a few more tests with @btime, here’s the result for some test params.

1 Thread

1.880 s (41082894 allocations: 3.78 GiB)

2 Threads

3.198 s (33789435 allocations: 3.16 GiB)

3 Threads

3.038 s (26541670 allocations: 2.61 GiB)

4 Threads

2.883 s (21525672 allocations: 2.22 GiB)

As I reported earlier, there seems to be some scaling with the number of threads, however we do not get close single thread performance.

Another thing that keeps confusing me is the following. When I run the calculations and track the activtity using top, the single threaded version has a more or less stable RAM consumption of ~1 %. However, upon enabling more than one thread RAM consumption fluctuates a lot, reaching spikes of about ~50 %. I assume this excessive usage comes from compilations during the calculation - which would somehow hint at #15276, however I see no strange type behaviour with @code_warntype when put in front of computeStep() and and also not for the computeKernel() function.

I guess garbage collecting can be excluded since turning it off manually via GC.enable(false) does not accomplish anything.

Manually introducing a lock upon the threaded loops gets rid of the RAM spikes, however I still do not observe any speedups (nor a slowdown). I’m not even sure what the fact that I apparently need to use locks tells me about my problem. Maybe someone can explain ?

function computeStep!(Derivative :: MyStruct1, Parameters :: MyStruct2) :: Nothing
          
             lock = Threads.Mutex()
             Threads.@threads for Iterator in Parameters.Iterators1
                  lock(lock)
                  Derivative.Kernel1[Iterator] = computeKernel1(Iterator, Parameters)
                  unlock(lock)
             end 

             Threads.@threads for Iterator in Parameters.Iterators2
                  lock(lock)
                  Derivative.Kernel2[Iterator] = computeKernel2(Iterator, Parameters)
                  unlock(lock)
             end 

end

EDIT: Wrapping each loop body in

if trylock(gate)
   do loop  
unlock(gate) 
end

finally speeds up the code !!! However the result does not appear to be correct anymore - but apparently playing around with locks and thread synchronization in here is the way to settle the issue.

EDIT: Doing the if clause without an exception of course gives a speedup, but were skipping calculations, not speeding them up - a stupid mistake on my side, got too excited about the performance I guess.