Performance issue with multithreaded computation with matrix operations at its heart (Threads.@threads vs. BLAS threads)

HanD · November 10, 2023, 4:20pm

While trying to speed up a long computation with chunking more threads on it, I ran into an interesting problem. I found several related threads here, but I couldn’t find a convincing answer, nor a solution to my issue.

Here’s an MVP of my issue. Consider the following definitions:

julia> using LinearAlgebra, BenchmarkTools
julia> ts = 1:10; xs = rand(1000, 10);
julia> function fit(x, y, degree)
           return qr([x_n ^ k for x_n in x, k in 0:degree]) \ y
       end

And then I run the following measurements, with JULIA_NUM_THREADS=8:

julia> @benchmark (for row in $(collect(eachrow(xs))); fit(ts, row, 3); end)
BenchmarkTools.Trial: 1156 samples with 1 evaluation.
 Range (min … max):  4.181 ms …   6.373 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     4.228 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   4.326 ms ± 284.965 μs  ┊ GC (mean ± σ):  0.76% ± 3.36%

  ▄██▃▁                ▁                                       
  █████▆▆▅▆▅▅▅▄▅▅▄▄▄▅▁▆██▆▆▅▆█▅▄▄▁▅▁▄▅▁▄▄▁▅▄▄▅▅▇▇▆▆▅▄▅▁▅▁▅▁▄▅ █
  4.18 ms      Histogram: log(frequency) by time      5.48 ms <

 Memory estimate: 1.50 MiB, allocs estimate: 8000.

julia> @benchmark @threads (for row in $(collect(eachrow(xs))); fit(ts, row, 3); end)
BenchmarkTools.Trial: 314 samples with 1 evaluation.
 Range (min … max):  12.338 ms … 69.561 ms  ┊ GC (min … max): 0.00% … 77.62%
 Time  (median):     15.782 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   15.939 ms ±  3.073 ms  ┊ GC (mean ± σ):  1.08% ±  4.38%

                                        ▂▂▃█ ▁▄▃▆▄             
  ▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▁▁▃▃▂▁▂▃▁▃▃▅▃▆██████████▅▂▃▁▁▂▁▁▁▁▂ ▃
  12.3 ms         Histogram: frequency by time        17.1 ms <

 Memory estimate: 1.50 MiB, allocs estimate: 8049.

Notice how the execution time went up, even though the computation runs on 8 threads instead of 1. If I observe CPU usage in htop, it is obvious that most of the resources are wasted in system calls, most of the bars are red. I found suggestions in various threads that calling BLAS.set_num_threads(1) could improve the situation, but in my case, it has no visible effect, I get identical results.

I’m guessing that the Julia threads compete for the LAPACK library calls, which thus constitutes a bottleneck, but I don’t know how to get around this issue. Ideally, the (more elaborate) computation would be running on a 64 core CPU, which is currently sitting mostly idle, because I can only run this on a single thread.

Any ideas?

HanD · November 10, 2023, 4:30pm

Okay, I found something interesting. If I remove the qr decomposition from fit, the single-threaded solution slows down by a factor of 2, but the multi-threaded solution suddenly becomes a lot more efficient:

julia> function fit(x, y, degree)
           return [x_n ^ k for x_n in x, k in 0:degree] \ y
       end

julia> @benchmark (for row in $(collect(eachrow(xs))); fit(ts, row, 3); end)
BenchmarkTools.Trial: 568 samples with 1 evaluation.
 Range (min … max):  7.684 ms …  11.128 ms  ┊ GC (min … max): 15.37% … 14.02%
 Time  (median):     8.694 ms               ┊ GC (median):    15.73%
 Time  (mean ± σ):   8.803 ms ± 628.737 μs  ┊ GC (mean ± σ):  20.52% ±  5.92%

         ▂▄█▆▅▄▆█▆▄      ▂   ▁▃▄▆▅▄ ▄ ▂                        
  ▄▃▄▄▄▅████████████▄▄▆▃▄█▆▆▇██████▇███▄▅▅▄▄▂▆▄▄▄▂▂▂▁▁▂▁▂▁▁▃▃ ▄
  7.68 ms         Histogram: frequency by time        10.6 ms <

 Memory estimate: 69.03 MiB, allocs estimate: 45000.

julia> @benchmark @threads (for row in $(collect(eachrow(xs))); fit(ts, row, 3); end)
BenchmarkTools.Trial: 2620 samples with 1 evaluation.
 Range (min … max):  1.028 ms … 24.744 ms  ┊ GC (min … max):  0.00% … 11.35%
 Time  (median):     1.375 ms              ┊ GC (median):     0.00%
 Time  (mean ± σ):   1.903 ms ±  1.430 ms  ┊ GC (mean ± σ):  25.81% ± 25.75%

   ▂▄▅▅██▆▂▁▁                                    ▁▁ ▁▄▅▃     ▁
  ▇██████████▇▇▇▆▇▇▆▅▄▃▃▁▁▄▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▅█████████▇▆▅ █
  1.03 ms      Histogram: log(frequency) by time      4.3 ms <

 Memory estimate: 69.03 MiB, allocs estimate: 45049.

So what is going on with qr that makes it particularly inefficient when invoked from multiple threads in parallel?

gdalle · November 10, 2023, 4:48pm

I have several hypotheses here:

maybe it’s a nasty interaction between Julia threads and BLAS threads, as you mentioned => there is now a section in the performance tips about this
maybe it’s the additional allocations triggered by qr which render multithreading inefficient
maybe it’s linked to you benchmarking with globals
maybe it’s the macros @benchmark and @threads which don’t play nice together (unlikely)

carstenbauer · November 10, 2023, 4:55pm

You’re matrices are (too) tiny.

function serial(mat, vec)
    for i in 1:1000
        qr(mat) \ vec
    end
end

function threaded(mat, vec)
    @threads for i in 1:1000
        qr(mat) \ vec
    end
end

mat = rand(10,4)
vec = rand(10)

@btime serial($mat, $vec) # 6.052 ms (7000 allocations: 1.10 MiB)
@btime threaded($mat, $vec) # 17.643 ms (7031 allocations: 1.10 MiB)

mat2 = rand(1000,100)
vec2 = rand(1000)

@btime serial($mat2, $vec2) # 2.804 s (10000 allocations: 827.50 MiB)
@btime threaded($mat2, $vec2) # 1.024 s (10031 allocations: 827.50 MiB)

(with 6 threads)

HanD · November 10, 2023, 5:02pm

That may be so, but what can I do if I need to least square fit 3rd degree polynomials on 10 points. These are the matrices I have to work with.

As a workaround, I can remove the qr transformation, yielding the same result.

gdalle · November 10, 2023, 5:03pm

Use StaticArrays.jl?

HanD · November 13, 2023, 7:45am

I don’t see how that would help here, or even how it would work. For one thing, the size of the Vandermonde matrix above isn’t known at compile time, since the number of points to fit and the degree of the desired polynomial is only known at runtime. Also, the StaticArrays.jl manual suggests to use static arrays only below 100 items, and in this case, I can easily have more than that. The 3 and 10 are just specific examples.

gdalle · November 13, 2023, 5:00pm

My bad, when you said “these are the matrices I have to work with” I thought they would always look like that

Topic		Replies	Views
Julia Threads vs BLAS threads Internals & Design	16	11069	July 26, 2018
Multi-threading of julia-1.8.5 does not improve speed when combined with BLAS New to Julia	17	1529	May 1, 2023
Why doesn't multithreading help here? Performance	12	1459	August 22, 2020
Poor performance multiplying many (large) matrices multithreaded Performance question , linearalgebra	11	2519	July 13, 2020
Data structures for threaded computing Performance	23	2922	October 23, 2019

Performance issue with multithreaded computation with matrix operations at its heart (Threads.@threads vs. BLAS threads)

Related topics