While trying to speed up a long computation with chunking more threads on it, I ran into an interesting problem. I found several related threads here, but I couldn’t find a convincing answer, nor a solution to my issue.
Here’s an MVP of my issue. Consider the following definitions:
julia> using LinearAlgebra, BenchmarkTools
julia> ts = 1:10; xs = rand(1000, 10);
julia> function fit(x, y, degree)
return qr([x_n ^ k for x_n in x, k in 0:degree]) \ y
end
And then I run the following measurements, with JULIA_NUM_THREADS=8
:
julia> @benchmark (for row in $(collect(eachrow(xs))); fit(ts, row, 3); end)
BenchmarkTools.Trial: 1156 samples with 1 evaluation.
Range (min … max): 4.181 ms … 6.373 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 4.228 ms ┊ GC (median): 0.00%
Time (mean ± σ): 4.326 ms ± 284.965 μs ┊ GC (mean ± σ): 0.76% ± 3.36%
▄██▃▁ ▁
█████▆▆▅▆▅▅▅▄▅▅▄▄▄▅▁▆██▆▆▅▆█▅▄▄▁▅▁▄▅▁▄▄▁▅▄▄▅▅▇▇▆▆▅▄▅▁▅▁▅▁▄▅ █
4.18 ms Histogram: log(frequency) by time 5.48 ms <
Memory estimate: 1.50 MiB, allocs estimate: 8000.
julia> @benchmark @threads (for row in $(collect(eachrow(xs))); fit(ts, row, 3); end)
BenchmarkTools.Trial: 314 samples with 1 evaluation.
Range (min … max): 12.338 ms … 69.561 ms ┊ GC (min … max): 0.00% … 77.62%
Time (median): 15.782 ms ┊ GC (median): 0.00%
Time (mean ± σ): 15.939 ms ± 3.073 ms ┊ GC (mean ± σ): 1.08% ± 4.38%
▂▂▃█ ▁▄▃▆▄
▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▁▁▃▃▂▁▂▃▁▃▃▅▃▆██████████▅▂▃▁▁▂▁▁▁▁▂ ▃
12.3 ms Histogram: frequency by time 17.1 ms <
Memory estimate: 1.50 MiB, allocs estimate: 8049.
Notice how the execution time went up, even though the computation runs on 8 threads instead of 1. If I observe CPU usage in htop
, it is obvious that most of the resources are wasted in system calls, most of the bars are red. I found suggestions in various threads that calling BLAS.set_num_threads(1)
could improve the situation, but in my case, it has no visible effect, I get identical results.
I’m guessing that the Julia threads compete for the LAPACK library calls, which thus constitutes a bottleneck, but I don’t know how to get around this issue. Ideally, the (more elaborate) computation would be running on a 64 core CPU, which is currently sitting mostly idle, because I can only run this on a single thread.
Any ideas?