I am running the following (test) code on a 24 core machine
julia> Threads.nthreads()
24
julia> using LinearAlgebra
julia> function foo()
Threads.@threads for i in 1:1000000
H = rand(100, 100) + im * rand(100, 100)
eigen(Hermitian(H))
end
end
foo (generic function with 1 method)
julia> foo()
However, from the command top, I read
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2196 myname 20 0 4277816 1.652g 121204 R 1858 1.775 5:11.24 julia
I assume from this output some of the computation power has not been utilized. What’s the problem here?
julia> function foo()
Threads.@threads for i in 1:1000000000000
exp(sin(cos(exp(sin(rand())))))
end
end
foo (generic function with 1 method)
julia> foo()
then top gives me better usage
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3410 myname 20 0 2648668 283072 84088 R 2365 0.290 29:42.46 julia
julia> Threads.nthreads()
24
julia> using LinearAlgebra
julia> BLAS.set_num_threads(24)
julia> function foo()
for i in 1:1000000
H = rand(100, 100) + im * rand(100, 100)
eigen(H)
end
end
foo (generic function with 1 method)
julia> foo()
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3640 myname 20 0 3198364 331372 99372 R 2363 0.339 21:33.06 julia
Actually I am wondering whether this is reproducible for someone else? Or is this a problem of my cluster?
Clearly I want to utilize the full computational power of my cluster and make my code as quick as possible. This was a MWE from a much larger code base and I am experimenting with multi-threading.
From the stats you posted, you’re very close to using the full computational power of your cluster. Your system load average is simply that, and you cannot expect to have all cores of a multiprocess, multiuser system working on your problem exclusively.
Getting 2300+% on a 24-core system is going to be the best you can do. And you’re getting that as long as you don’t allow Julia’s threads and BLAS threads to stomp on each other, and as long as you optimize your threads for the problem at hand.
Also, do you know how many physical cores your CPU has? If it has 24 threads, it is likely that it only has 12 physical cores, in which case there will be sharply diminishing returns beyond 1200% for many workloads – including BLAS and LAPACK.
Each physical core has its own set of execution units.
Keeping these execution units busy requires very well optimized code. Normally, a few of them will be sitting idle. A second thread on the same physical core can share these execution units, to try and get closer to 100% utilization.
But for many BLAS/LAPACK routines, they’re so well optimized that a single thread is often able to very nearly use a core to its fullest. Often, the cache contention of extra threads will actually hurt the performance of these routines, meaning some of them will actually perform best with only a single thread per physical core.
I find that when each thread is allocating memory in the threaded loop, it’s often hard to maximize the benefit of threading. If the allocations can be hoisted out of the loop and made thread specific, it’s usually a big win.
My point is that if I use BLAS.set_num_threads(1), I cannot get 2300%+. The best I can do is 2100%. I don’t think I should do BLAS.set_num_threads(24) because a large part of my code is not handled by BLAS.
I do have 24 physical cores, that’s for sure. There are two CPUs in my computing node and each of them has 12 physical cores. Hyberthreading is disabled.
I was wondering whether I should migrate my code from current multi-processing model to multi-threading model. I am getting ~100% usage of the CPU with the current multi-processing code for all cores.
Although my current code is computing something different from what I am showing in this post. The model is the same: it is just simple for loop and each thread does not talk to each other at all. Therefore I would expect similar CPU usage as multi-processing. I am wondering what makes it difficult to consume all CPU power with multi-threading.
Are you comparing execution times using @time or @btime from BenchmarkTools?
There are definitely situations in which using threads or increasing the number of threads slows down the overall computation, so unless you are wanting to test your CPU cooler, maximizing CPU utilization is probably not the correct metric.
That’s not what your stats above show. If you’re not using BLAS operations, then setting BLAS threads to 1 and nthreads to 24 gave you a CPU utilization of 2365%.
When you’re using BLAS operations, setting nthreads to 1 and BLAS threads to 24 gave you 2363% utilization.
My point is that you should pick the threading model that works for your data. If you’re not using BLAS, or it’s not in your hot path, don’t set BLAS threads > 1. If you’re using BLAS extensively, prefer BLAS threads to Julia threads.
When looking at performance, you have to consider the whole architecture, not just the CPU perfromance. The speed and layout of caches and the speed of main memory access.
I recently found this course which is very much worth following: http://wgropp.cs.illinois.edu/courses/cs598-s16/index.htm
I guess with modern compilers this is hidden, but when I were a young thing people would tune the size of their arrays to be a multiple of the word length and to fit into the size of cache lines.
I would make those matrices 128x128 !!
I’ll try to learn more about high performance computing.
Do I understand correctly that, tasks that do not talk to each other are better handled with multiprocessing? Since if I change my original code to multiprocessing, it is 40% faster.