Multi-threading of julia-1.8.5 does not improve speed when combined with BLAS

Hi,
I am very new to Julia and try to use multi-threading to make my code faster. If I do not use linear algebra, my code using “Thread.@threads” for the “for” loop speeds up as expected. However, when I use linear algebra in the loop, it just does not give any speed increase. (Indeed, it gets worse.) The two test codes (w or w/o linear algebra) are as follows:

Code w/o linear algebra

function sincall(x)
    return(sin(x))
end

function test()
    nMax = 1e2
    mMax = 1e6
    ax = 1.0
    ay = 0.0
    Threads.@threads for n = 1:nMax
        tmp = 0
        for m = 1:mMax
            tmp = tmp + sincall(2*pi*(m*ax+m*ay))
        end
    end
end
test()

This gives the results as:

> time julia -t 1 test.jl

real    0m2.634s
user    0m3.085s
sys     0m4.083s

> time julia -t 4 test.jl

real    0m0.969s
user    0m3.550s
sys     0m4.114s

On the other hand, code w linear algebra:

using LinearAlgebra

BLAS.set_num_threads(1)

function test()
    nMax = 1e2
    mMax = 1e5
    Threads.@threads for n = 1:nMax
        tmp = 0
        for m = 1:mMax
            a = rand(10,10);
            tmp = tmp + sum(a*a)
        end
    end
end
test()

This gives the results as:

> time julia -t 1 test2.jl

real    0m6.948s
user    0m7.345s
sys     0m4.198s

> time julia -t 4 test2.jl

real    0m19.462s
user    0m38.935s
sys     0m37.127s

Maybe I am doing something completely wrong. Does anybody have any clue to solve this issue?

Julia version is 1.8.5, running on Debian/Bookworm (on Ryzen 3970X).

Many thanks,
tj

The issue might be memory allocations in the second example, which prevent you from reaping the full benefits of multithreading

1 Like

As @gdalle said, allocations will likely make multithreading performance much worse. Allocations happen when creating the 10x10 rand array. If you instead preallocate the array, then it may be a much better comparison:

using Random
using LinearAlgebra
function test()
    nMax = 1e2
    mMax = 1e5
    Threads.@threads for n = 1:nMax
        tmp = 0.0
        a = Matrix{Float64}(undef, 10, 10)
        b = similar(a)
        for m = 1:mMax
            rand!(a)
            mul!(b, a, a)
            tmp = tmp + sum(b)
        end
    end
end

You should also not time a script from the outside command line as this can be widely misleading due to compilation times and no repeats. Instead, it is standard practise to use BenchmarkTools.jl:

julia> using BenchmarkTools
julia> @benchmark test()

Executed during a REPL session.

Edit: Fixed example to be only allocating at the start. This can be improved if needed.
Edit: Made code identical

1 Like

Note that allocations also happen when you compute a * a before taking the sum. But more importantly, the code @jmair doesn’t seem thread-safe, because several threads can write into the same matrix storage. I would recommend one matrix per thread

Thanks for spotting the performance issue, which has been fixed. The code is still completely thread-safe though as the inner loop all happens on the same thread and there are no shared variables.

Oh right, I had not noticed that the matrix a was created within the threaded loop. My bad!

But in “fixing the performance issue” you turned the computation of a * a into a .* a, which is not what the initial problem stated. Perhaps using the in-place multiplication mul! would be more appropriate here?

1 Like

Yes this is definitely needed. Although we need a second array since inplace mul! cannot be aliased.

1 Like

Note also that in this example the temporary matrices are being created for each iteration of the nMax loop, and not one per thread. That can allocate less and be somewhat faster if the workload is divided in chunks:

julia> @btime test()
  380.045 ms (274 allocations: 182.22 KiB)

julia> using ChunkSplitters

julia> function test2(;nchunks=Threads.nthreads())
           nMax = 1e2
           mMax = 1e5
           Threads.@threads for (i_range, _) in chunks(1:nMax,nchunks)
               tmp = 0.0
               a = Matrix{Float64}(undef, 10, 10)
               b = similar(a)
               for i in i_range
                   for m = 1:mMax
                       rand!(a)
                       mul!(b, a, a)
                       tmp = tmp + sum(b)
                   end
               end
           end
       end
test2 (generic function with 1 method)

julia> @btime test2()
  333.823 ms (98 allocations: 28.41 KiB)
3 Likes

Might not answer the exact question, but some might find it useful anyway…

My experience with multithreaded BLAS is that it should only be used on very large problems. On my Linux Ryzen 3700X, it is often better to set BLAS to single threaded if I use many matrix operation in the region of 100-1000 dimensions. BLAS can fully saturate the cores, however the system time overhead will be about 3 times of what the actual useful computational time is.

If a problem needs to be solved with multiple input data, it is much better to run multiple Julia instances and set the BLAS thread to 1 in each of them. This is way faster than running multithreaded calculations in sequence.

1 Like

Thank you so much, gdalle, jimair, lmiq, and fastwave! I am VERY sorry that my response is so slow…

I tried the suggested changes one by one, but still my issue remains… My code (test4.jl) now reads the same as the final one by lmiq:

using LinearAlgebra
using BenchmarkTools
using Random
using ChunkSplitters

BLAS.set_num_threads(1)

function test(;nchunks=Threads.nthreads())
    nMax = 1e2
    mMax = 1e5
    Threads.@threads for (i_range, _) in chunks(1:nMax,nchunks)
        tmp = 0.0
        a = Matrix{Float64}(undef, 10, 10)
        b = similar(a)
        for i in i_range
            for m = 1:mMax
                rand!(a)
                mul!(b, a, a)
                tmp = tmp + sum(b)
            end
        end
    end
end
out = @benchmark test() evals=1 samples=1 seconds=300
dump(out)

If I run it on the same machine (Ryzen 3970X with julia 1.8.5), the output is:

> julia -t 1 test4.jl
BenchmarkTools.Trial
  params: BenchmarkTools.Parameters
    seconds: Float64 300.0
    samples: Int64 1
    evals: Int64 1
    overhead: Float64 0.0
    gctrial: Bool true
    gcsample: Bool false
    time_tolerance: Float64 0.05
    memory_tolerance: Float64 0.01
  times: Array{Float64}((1,)) [4.868270248e9]
  gctimes: Array{Float64}((1,)) [0.0]
  memory: Int64 2400
  allocs: Int64 8

>  julia -t 2 test4.jl
BenchmarkTools.Trial
  params: BenchmarkTools.Parameters
    seconds: Float64 300.0
    samples: Int64 1
    evals: Int64 1
    overhead: Float64 0.0
    gctrial: Bool true
    gcsample: Bool false
    time_tolerance: Float64 0.05
    memory_tolerance: Float64 0.01
  times: Array{Float64}((1,)) [1.1478137366e10]
  gctimes: Array{Float64}((1,)) [0.0]
  memory: Int64 4800
  allocs: Int64 17

This still runs pretty slow if I use 2 threads compared to 1. I run the same code on my macmini (julia 1.8.5, M1, MacOS 11.7.2), the result is quite different:

> julia -t 1 test4.jl
BenchmarkTools.Trial
  params: BenchmarkTools.Parameters
    seconds: Float64 300.0
    samples: Int64 1
    evals: Int64 1
    overhead: Float64 0.0
    gctrial: Bool true
    gcsample: Bool false
    time_tolerance: Float64 0.05
    memory_tolerance: Float64 0.01
  times: Array{Float64}((1,)) [2.964288542e9]
  gctimes: Array{Float64}((1,)) [0.0]
  memory: Int64 2400
  allocs: Int64 8

> julia -t 2 test4.jl
BenchmarkTools.Trial
  params: BenchmarkTools.Parameters
    seconds: Float64 300.0
    samples: Int64 1
    evals: Int64 1
    overhead: Float64 0.0
    gctrial: Bool true
    gcsample: Bool false
    time_tolerance: Float64 0.05
    memory_tolerance: Float64 0.01
  times: Array{Float64}((1,)) [2.982752208e9]
  gctimes: Array{Float64}((1,)) [0.0]
  memory: Int64 4832
  allocs: Int64 18

It do not gets faster, but at least it does not gets slow… I am pretty confused now… Maybe I am doing completely wrong… Any clue is highly welcome.

Thank you so much!
tj

1 Like

edited: I’m not sure if I can really reproduce the issue. Seems to happen for time to time, but I would need to be very careful in reproducing it, because it is not deterministic, it seems.

1 Like

And the rest of the time, what happens? Threaded version is faster?

Ok, I can reproduce the problem now, systematically.

This is the code, which is simpler and uses only base functions. It also produces a result that cannot be optimized out:

import Pkg
Pkg.activate(".")

using LinearAlgebra
using BenchmarkTools
using Random

BLAS.set_num_threads(1)

function test()
    nMax = 1e2
    mMax = 1e3
    s = zeros(Threads.nthreads())
    Threads.@threads for i in 1:nMax
        tmp = 0.0
        a = rand(10,10)
        b = similar(a)
        for m = 1:mMax
            mul!(b, a, a)
            tmp = tmp + sum(b)
        end
        s[Threads.threadid()] = tmp
    end
    return sum(s)
end

@btime test()

using MKL
@btime test()

I get, here:

% julia -t 1 code.jl
   Activating project at `~/Downloads`
  29.816 ms (208 allocations: 175.73 KiB)
  28.485 ms (208 allocations: 175.73 KiB)

which means that base BLAS and MKL mul! routines performed similarly.

Now with 4 threads:

% julia -t 4 code.jl
  Activating project at `~/Downloads`
  43.085 ms (226 allocations: 177.53 KiB)
  10.046 ms (226 allocations: 177.53 KiB)

Thus, LinearAlgebra.mul! is slower than with a single thread, while MKL.mul! is, as expected, faster.

Bug report filled: multi-threaded parallel calls to `LinearAlgebra.mul!` are slower than serial calls · Issue #49455 · JuliaLang/julia · GitHub

1 Like

Are you sure this is unexpected? Matrix multiplication is slower when multithreading in Julia - #12 by carstenbauer

1 Like

I don’t know, anyway I don’t see any way to use multi-threading effectively there if sticking to the base mul!(changing the number of BLAS threads to 4 does not affect the result).

As a curiosity, Octavian is much faster than BLAS or MKL here:

with -t4:

  2.798 ms (225 allocations: 177.50 KiB)  << octavian
  9.178 ms (226 allocations: 177.53 KiB)  << MKL

with -t1

  8.105 ms (208 allocations: 175.73 KiB) << octavian
  26.863 ms (208 allocations: 175.73 KiB) << MKL

code:

import Pkg
Pkg.activate(".")

using BenchmarkTools
using Random
using Octavian

function test(local_mul!::F) where F<:Function
    nMax = 1e2
    mMax = 1e3
    s = zeros(Threads.nthreads())
    Threads.@threads for i in 1:nMax
        tmp = 0.0
        a = rand(10,10)
        b = similar(a)
        for m = 1:mMax
            local_mul!(b, a, a)
            tmp = tmp + sum(b)
        end
        s[Threads.threadid()] = tmp
    end
    return sum(s)
end

@btime test($(Octavian.matmul!))

using MKL
@btime test($(MKL.mul!))

Hi, gdalle, and lmiq,

Thanks again! For my practical purpose, Octavian does the trick! Thank you also for filling the bug report. I will keep watching it.

thanks,
tj