While trying to speed up a long computation with chunking more threads on it, I ran into an interesting problem. I found several related threads here, but I couldn’t find a convincing answer, nor a solution to my issue.
Here’s an MVP of my issue. Consider the following definitions:
julia> using LinearAlgebra, BenchmarkTools julia> ts = 1:10; xs = rand(1000, 10); julia> function fit(x, y, degree) return qr([x_n ^ k for x_n in x, k in 0:degree]) \ y end
And then I run the following measurements, with
julia> @benchmark (for row in $(collect(eachrow(xs))); fit(ts, row, 3); end) BenchmarkTools.Trial: 1156 samples with 1 evaluation. Range (min … max): 4.181 ms … 6.373 ms ┊ GC (min … max): 0.00% … 0.00% Time (median): 4.228 ms ┊ GC (median): 0.00% Time (mean ± σ): 4.326 ms ± 284.965 μs ┊ GC (mean ± σ): 0.76% ± 3.36% ▄██▃▁ ▁ █████▆▆▅▆▅▅▅▄▅▅▄▄▄▅▁▆██▆▆▅▆█▅▄▄▁▅▁▄▅▁▄▄▁▅▄▄▅▅▇▇▆▆▅▄▅▁▅▁▅▁▄▅ █ 4.18 ms Histogram: log(frequency) by time 5.48 ms < Memory estimate: 1.50 MiB, allocs estimate: 8000. julia> @benchmark @threads (for row in $(collect(eachrow(xs))); fit(ts, row, 3); end) BenchmarkTools.Trial: 314 samples with 1 evaluation. Range (min … max): 12.338 ms … 69.561 ms ┊ GC (min … max): 0.00% … 77.62% Time (median): 15.782 ms ┊ GC (median): 0.00% Time (mean ± σ): 15.939 ms ± 3.073 ms ┊ GC (mean ± σ): 1.08% ± 4.38% ▂▂▃█ ▁▄▃▆▄ ▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▁▁▃▃▂▁▂▃▁▃▃▅▃▆██████████▅▂▃▁▁▂▁▁▁▁▂ ▃ 12.3 ms Histogram: frequency by time 17.1 ms < Memory estimate: 1.50 MiB, allocs estimate: 8049.
Notice how the execution time went up, even though the computation runs on 8 threads instead of 1. If I observe CPU usage in
htop, it is obvious that most of the resources are wasted in system calls, most of the bars are red. I found suggestions in various threads that calling
BLAS.set_num_threads(1) could improve the situation, but in my case, it has no visible effect, I get identical results.
I’m guessing that the Julia threads compete for the LAPACK library calls, which thus constitutes a bottleneck, but I don’t know how to get around this issue. Ideally, the (more elaborate) computation would be running on a 64 core CPU, which is currently sitting mostly idle, because I can only run this on a single thread.