Matrix multiplication is slower when multithreading in Julia

pablof300 · March 1, 2021, 3:43am

I am working with big matrices (size of 30k rows and ~100 columns). I am doing some matrix multiplication and the process would take around 20 seconds. This is my code:

    @time begin
        result = -1
        data = -1
    
        for i=1:size
            first_matrix = @view data[i * split,:]
    
            for j=1:size
                second_matrix = @view Qg[j * split,:]
                matrix_multiplication = first_matrix * second_matrix'
                current_sum = sum(matrix_multiplication)
                global result
    
                if current_sum > result
                    result = current_sum
                    data = matrix_multiplication[1,1]
                end
            end
        end
    end

Trying to optimize this a little more, I tried to use multi-threading (julia --thread 4) to get better performance.

    @time begin
        global result = -1
        global data = -1
        lock = ReentrantLock()
    
        for i=1:size
            first_matrix = @view data[i * split,:]
    
            Threads.@threads for j=1:size
                second_matrix = @view Qg[j * split,:]
                matrix_multiplication = first_matrix * second_matrix'
                current_sum = sum(matrix_multiplication)
                global result
    
                if current_sum > result
                    lock(lock)
                    result = current_sum
                    data = matrix_multiplication[1,1]
                    unlock(lock)
                end
            end
        end
    end

By adding multi-threading I thought I would get an increase in performance, but the performance got worse (~40 seconds). I removed the lock to see if that was the issue, but still got the same performance. I am running this on a Dual-Core Intel Core i5 (MacBook pro). Does anyone know why my multi-threading code doesn’t work?

DNF · March 1, 2021, 6:32am

Three things first

Avoid working in global scope like this, and definitely avoid global variables. Wrap all of this in a function, and make sure you don’t use globals.
Arrays in Julia are column major, while you are working along rows. This will hurt performance.
Could you make a self-contained, running code example with input data (random), so others can try running your code?

Don’t think about multithreading until your single threaded performance is good.

Elrod · March 1, 2021, 7:03am

a) Julia’s BLAS is already multithreaded. Try using LinearAlgebra; BLAS.set_num_threads(1) to make it single threaded when using Threads.@threads for your multithreading.
b) BLAS is fastest with 1 thread per core, so use just 2 threads on a dual core CPU.

tkf · March 1, 2021, 7:18am

You are reading result outside the lock. It’s probably an undefined behavior. I think you’d want to check if your program is correct, before asking if it is fast.

Unless you are very aware of the pros and cons of what you are doing, I’d suggest avoiding locks as much as possible, if you want efficient computation. It is very likely there is a better way to phrase your parallel computation. FYI, my introduction of data parallelism in Julia discusses various high-level data-parallel patterns: A quick introduction to data parallelism in Julia

This and the following code data = matrix_multiplication[1,1] indicates that you don’t actually need to materialize matrix_multiplication as a matrix. It’d be faster (probably even with a single thread) to compute the sum and (1, 1)-th element directly (in one sweep).

pablof300 · March 1, 2021, 3:21pm

How can I avoid the lock if I need to redefine a shared variable among threads?

DNF · March 1, 2021, 3:41pm

I really meant it when I said you should fix your single threaded code first. You will get much bigger gains just from fixing that, and the parallelization will work better too.

You are starting at the wrong end.

tkf · March 1, 2021, 8:59pm

I agree with DNF’s suggestion that first thing to try is to improve the serial code by making more idiomatic Julia.

Do you really need to update the global? I’d suggest avoid mixing how to compute things and what to compute. If you just need to find the maximum and a corresponding value (= what), you don’t need to use globals (= how). My tutorial includes a subsection for an approach to do this “findmax” type of computations that is applicable for serial, threaded, distributed, and GPU computations.

e3c6 · January 20, 2022, 10:53am

If I start Julia with julia - t 4 (say), each of those 4 Julia threads gets its own pool of BLAS threads?

If the answer is no, is there a way to make each julia thread spawn its own BLAS threads?

jpsamaroo · January 20, 2022, 2:53pm

No, changing the number of Julia threads does not affect the number of BLAS threads.

BLAS.set_num_threads(1) doesn’t make all BLAS calls run on just a single thread (which would suck); it just prevents usage of the BLAS threadpool (IIUC), instead causing BLAS code to be directly executed on the Julia thread invoking it. This still allows parallelism when executing many BLAS operations concurrently, if you execute enough of them.

e3c6 · January 20, 2022, 4:47pm

I should call BLAS.set_num_threads(1) within each thread? Or just one global call?

jpsamaroo · January 20, 2022, 5:35pm

Just once, the setting is global to the Julia session.

carstenbauer · January 20, 2022, 5:45pm

While this is true (at least I think so as well), I guess the interesting part is what happens for BLAS.set_num_threads(N) where N>1. And based on a simple test I just ran, OpenBLAS (default) and MKL behave very differently here!

Test function:

function f()
    X = rand(1000,1000);
    Y = zeros(Threads.nthreads())
    Threads.@threads for i in 1:Threads.nthreads()
        Y[i] = sum(X * X)
    end
    return Y
end

MKL:

# 4 Julia threads

julia> BLAS.set_num_threads(1)

julia> @btime f(); # ~400% CPU usage
  32.890 ms (39 allocations: 45.78 MiB)

julia> BLAS.set_num_threads(2)

julia> @btime f(); # ~800% CPU usage
  18.541 ms (39 allocations: 45.78 MiB)

julia> BLAS.get_config()
LinearAlgebra.BLAS.LBTConfig
Libraries:
└ [ILP64] libmkl_rt.so

OpenBLAS:

# 4 Julia threads

julia> BLAS.set_num_threads(1)

julia> @btime f(); # ~400% CPU usage
  36.527 ms (39 allocations: 45.78 MiB)

julia> BLAS.set_num_threads(2)

julia> @btime f(); # ~240% CPU usage
  90.159 ms (39 allocations: 45.78 MiB)

julia> BLAS.get_config()
LinearAlgebra.BLAS.LBTConfig
Libraries:
└ [ILP64] libopenblas64_.so

For for OpenBLAS, BLAS.set_num_threads(2) seems to control the total number of threads that BLAS uses. So increasing N in BLAS.set_num_threads(N) from N=1 to N=2 might actually lead to the computation running on fewer threads. IMHO, that’s very counterintuitive.

For MKL on the other hand, BLAS.set_num_threads(2) makes the BLAS computation use 2 * (# of Julia threads) many threads.

From the above I’d conclude the following. Let’s say a machine has C cores (no hyperthreading) and we have J Julia threads. What do we have to choose for N in BLAS.set_num_threads(N) to utilize all cores? For OpenBLAS, the answer seems to be N=C whereas for MKL you’d want to set N=C/J.

I could nicely confirm this for MKL where on a C=40 core machine I did set J=5 and N=8 to obtain

julia> @btime f(); # ~4000% CPU usage
  8.076 ms (39 allocations: 45.78 MiB)

Note that BLAS.set_num_threads(C), which strangely enough seems to be the default (why?!?), is understandably suboptimal:

julia> @btime f();
  19.382 ms (39 allocations: 45.78 MiB)

julia> BLAS.get_num_threads()
40

For OpenBLAS, it’s not as clean since N=C=40 only gave me

julia> @btime f(); # ~3000% CPU usage
  17.151 ms (39 allocations: 45.78 MiB)

(Note that going to N=45 didn’t improve the CPU usage.)

I guess the conclusion is: Set BLAS.set_num_threads(1) unless you know what you / your BLAS library is doing!

PS: Instead of running simple experiments it would of course make sense to read the OpenBLAS / MKL documentation on this matter. But experiments are just more fun

carstenbauer · January 21, 2022, 10:10pm

Citing a post by myself from a follow-up discussion on Slack:

Tried to find some documentation of OPENBLAS_NUM_THREADS but couldn’t find too much. There is

Faq · OpenMathLib/OpenBLAS Wiki · GitHub which essentially just says “always use OPENBLAS_NUM_THREADS=1 if your application is multithreaded”, and

GitHub - OpenMathLib/OpenBLAS: OpenBLAS is an optimized BLAS library based on GotoBLAS2 1.13 BSD version. which states that OPENBLAS_NUM_THREADS indeed specifies the “ maximum number of threads”

So I guess it at least behaves according to documentation.

It appears that OpenBLAS just starts OPENBLAS_NUM_THREADS many threads and then schedules work on them (unless OPENBLAS_NUM_THREADS == 1 which seems to be special-cased as @jpsamaroo mentioned above).

carstenbauer · January 21, 2022, 10:26pm

And since we’re already reading documentations, the MKL behavior can be understood e.g. from the documentation of mkl_set_num_threads_local (which allows one to set the number of MKL threads per executing thread (i.e. Julia thread). It says

If the thread-local number is not set or if this number is set to zero in a call to this function, Intel® oneAPI Math Kernel Library functions use the global number of threads.

I read this as # of local MKL threads == # of global MKL threads for each application thread (i.e. Julia thread).