Performance issue with multithreaded computation with matrix operations at its heart (Threads.@threads vs. BLAS threads)

While trying to speed up a long computation with chunking more threads on it, I ran into an interesting problem. I found several related threads here, but I couldn’t find a convincing answer, nor a solution to my issue.

Here’s an MVP of my issue. Consider the following definitions:

julia> using LinearAlgebra, BenchmarkTools
julia> ts = 1:10; xs = rand(1000, 10);
julia> function fit(x, y, degree)
           return qr([x_n ^ k for x_n in x, k in 0:degree]) \ y
       end

And then I run the following measurements, with JULIA_NUM_THREADS=8:

julia> @benchmark (for row in $(collect(eachrow(xs))); fit(ts, row, 3); end)
BenchmarkTools.Trial: 1156 samples with 1 evaluation.
 Range (min … max):  4.181 ms …   6.373 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     4.228 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   4.326 ms ± 284.965 μs  ┊ GC (mean ± σ):  0.76% ± 3.36%

  ▄██▃▁                ▁                                       
  █████▆▆▅▆▅▅▅▄▅▅▄▄▄▅▁▆██▆▆▅▆█▅▄▄▁▅▁▄▅▁▄▄▁▅▄▄▅▅▇▇▆▆▅▄▅▁▅▁▅▁▄▅ █
  4.18 ms      Histogram: log(frequency) by time      5.48 ms <

 Memory estimate: 1.50 MiB, allocs estimate: 8000.

julia> @benchmark @threads (for row in $(collect(eachrow(xs))); fit(ts, row, 3); end)
BenchmarkTools.Trial: 314 samples with 1 evaluation.
 Range (min … max):  12.338 ms … 69.561 ms  ┊ GC (min … max): 0.00% … 77.62%
 Time  (median):     15.782 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   15.939 ms ±  3.073 ms  ┊ GC (mean ± σ):  1.08% ±  4.38%

                                        ▂▂▃█ ▁▄▃▆▄             
  ▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▁▁▃▃▂▁▂▃▁▃▃▅▃▆██████████▅▂▃▁▁▂▁▁▁▁▂ ▃
  12.3 ms         Histogram: frequency by time        17.1 ms <

 Memory estimate: 1.50 MiB, allocs estimate: 8049.

Notice how the execution time went up, even though the computation runs on 8 threads instead of 1. If I observe CPU usage in htop, it is obvious that most of the resources are wasted in system calls, most of the bars are red. I found suggestions in various threads that calling BLAS.set_num_threads(1) could improve the situation, but in my case, it has no visible effect, I get identical results.

I’m guessing that the Julia threads compete for the LAPACK library calls, which thus constitutes a bottleneck, but I don’t know how to get around this issue. Ideally, the (more elaborate) computation would be running on a 64 core CPU, which is currently sitting mostly idle, because I can only run this on a single thread.

Any ideas?

Okay, I found something interesting. If I remove the qr decomposition from fit, the single-threaded solution slows down by a factor of 2, but the multi-threaded solution suddenly becomes a lot more efficient:

julia> function fit(x, y, degree)
           return [x_n ^ k for x_n in x, k in 0:degree] \ y
       end

julia> @benchmark (for row in $(collect(eachrow(xs))); fit(ts, row, 3); end)
BenchmarkTools.Trial: 568 samples with 1 evaluation.
 Range (min … max):  7.684 ms …  11.128 ms  ┊ GC (min … max): 15.37% … 14.02%
 Time  (median):     8.694 ms               ┊ GC (median):    15.73%
 Time  (mean ± σ):   8.803 ms ± 628.737 μs  ┊ GC (mean ± σ):  20.52% ±  5.92%

         ▂▄█▆▅▄▆█▆▄      ▂   ▁▃▄▆▅▄ ▄ ▂                        
  ▄▃▄▄▄▅████████████▄▄▆▃▄█▆▆▇██████▇███▄▅▅▄▄▂▆▄▄▄▂▂▂▁▁▂▁▂▁▁▃▃ ▄
  7.68 ms         Histogram: frequency by time        10.6 ms <

 Memory estimate: 69.03 MiB, allocs estimate: 45000.

julia> @benchmark @threads (for row in $(collect(eachrow(xs))); fit(ts, row, 3); end)
BenchmarkTools.Trial: 2620 samples with 1 evaluation.
 Range (min … max):  1.028 ms … 24.744 ms  ┊ GC (min … max):  0.00% … 11.35%
 Time  (median):     1.375 ms              ┊ GC (median):     0.00%
 Time  (mean ± σ):   1.903 ms ±  1.430 ms  ┊ GC (mean ± σ):  25.81% ± 25.75%

   ▂▄▅▅██▆▂▁▁                                    ▁▁ ▁▄▅▃     ▁
  ▇██████████▇▇▇▆▇▇▆▅▄▃▃▁▁▄▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▅█████████▇▆▅ █
  1.03 ms      Histogram: log(frequency) by time      4.3 ms <

 Memory estimate: 69.03 MiB, allocs estimate: 45049.

So what is going on with qr that makes it particularly inefficient when invoked from multiple threads in parallel?

I have several hypotheses here:

  • maybe it’s a nasty interaction between Julia threads and BLAS threads, as you mentioned => there is now a section in the performance tips about this
  • maybe it’s the additional allocations triggered by qr which render multithreading inefficient
  • maybe it’s linked to you benchmarking with globals
  • maybe it’s the macros @benchmark and @threads which don’t play nice together (unlikely)

You’re matrices are (too) tiny.

function serial(mat, vec)
    for i in 1:1000
        qr(mat) \ vec
    end
end

function threaded(mat, vec)
    @threads for i in 1:1000
        qr(mat) \ vec
    end
end

mat = rand(10,4)
vec = rand(10)

@btime serial($mat, $vec) # 6.052 ms (7000 allocations: 1.10 MiB)
@btime threaded($mat, $vec) # 17.643 ms (7031 allocations: 1.10 MiB)

mat2 = rand(1000,100)
vec2 = rand(1000)

@btime serial($mat2, $vec2) # 2.804 s (10000 allocations: 827.50 MiB)
@btime threaded($mat2, $vec2) # 1.024 s (10031 allocations: 827.50 MiB)

(with 6 threads)

1 Like

That may be so, but what can I do if I need to least square fit 3rd degree polynomials on 10 points. These are the matrices I have to work with.

As a workaround, I can remove the qr transformation, yielding the same result.

Use StaticArrays.jl?

1 Like

I don’t see how that would help here, or even how it would work. For one thing, the size of the Vandermonde matrix above isn’t known at compile time, since the number of points to fit and the degree of the desired polynomial is only known at runtime. Also, the StaticArrays.jl manual suggests to use static arrays only below 100 items, and in this case, I can easily have more than that. The 3 and 10 are just specific examples.

My bad, when you said “these are the matrices I have to work with” I thought they would always look like that