Why julia is not using all my CPU?

I am running the following (test) code on a 24 core machine

julia> Threads.nthreads()
24

julia> using LinearAlgebra

julia> function foo()
           Threads.@threads for i in 1:1000000
               H = rand(100, 100) + im * rand(100, 100)
               eigen(Hermitian(H))
           end
       end
foo (generic function with 1 method)

julia> foo()

However, from the command top, I read

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 2196 myname     20   0 4277816 1.652g 121204 R  1858 1.775   5:11.24 julia

I assume from this output some of the computation power has not been utilized. What’s the problem here?

2 Likes

Just a guess, and it could be completely irrelevant: is your code stepping on BLAS?

How many threads is BLAS using? ccall((:openblas_get_num_threads64_, Base.libblas_name), Cint, ()) will tell you.

If you set the BLAS threads to 1, do you see improved CPU utilization?

1 Like

BLAS was using 8 threads.

After BLAS.set_num_threads(1), the cpu usage becomes around 2100. This is a bit improvement but I was expecting something close to 2400.

What happens if you set nthreads to, say, 30?

This code is not the only thing running on the server, is my guess. It’s going to have some context- and process-switching.

I only have 24 cores and julia will ignore my request of 30 threads (Environment Variables · The Julia Language).

But if I am defining

julia> function foo()
           Threads.@threads for i in 1:1000000000000
               exp(sin(cos(exp(sin(rand())))))
           end
       end
foo (generic function with 1 method)

julia> foo()

then top gives me better usage

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 3410 myname     20   0 2648668 283072  84088 R  2365 0.290  29:42.46 julia

ok, and just to close this out: what happens if you set your nthreads to 1, but your blas threads to 24, with your original code?

julia> Threads.nthreads()
24

julia> using LinearAlgebra

julia> BLAS.set_num_threads(24)

julia> function foo()
           for i in 1:1000000
               H = rand(100, 100) + im * rand(100, 100)
               eigen(H)
           end
       end
foo (generic function with 1 method)

julia> foo()
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 3640 myname     20   0 3198364 331372  99372 R  2363 0.339  21:33.06 julia

Actually I am wondering whether this is reproducible for someone else? Or is this a problem of my cluster?

Why do you think this is a problem?

Clearly I want to utilize the full computational power of my cluster and make my code as quick as possible. This was a MWE from a much larger code base and I am experimenting with multi-threading.

From the stats you posted, you’re very close to using the full computational power of your cluster. Your system load average is simply that, and you cannot expect to have all cores of a multiprocess, multiuser system working on your problem exclusively.

Getting 2300+% on a 24-core system is going to be the best you can do. And you’re getting that as long as you don’t allow Julia’s threads and BLAS threads to stomp on each other, and as long as you optimize your threads for the problem at hand.

1 Like

Also, do you know how many physical cores your CPU has? If it has 24 threads, it is likely that it only has 12 physical cores, in which case there will be sharply diminishing returns beyond 1200% for many workloads – including BLAS and LAPACK.

Each physical core has its own set of execution units.
Keeping these execution units busy requires very well optimized code. Normally, a few of them will be sitting idle. A second thread on the same physical core can share these execution units, to try and get closer to 100% utilization.
But for many BLAS/LAPACK routines, they’re so well optimized that a single thread is often able to very nearly use a core to its fullest. Often, the cache contention of extra threads will actually hurt the performance of these routines, meaning some of them will actually perform best with only a single thread per physical core.

4 Likes

I find that when each thread is allocating memory in the threaded loop, it’s often hard to maximize the benefit of threading. If the allocations can be hoisted out of the loop and made thread specific, it’s usually a big win.

5 Likes

@anon94023334

Thank you.

My point is that if I use BLAS.set_num_threads(1), I cannot get 2300%+. The best I can do is 2100%. I don’t think I should do BLAS.set_num_threads(24) because a large part of my code is not handled by BLAS.

@Elrod

I do have 24 physical cores, that’s for sure. There are two CPUs in my computing node and each of them has 12 physical cores. Hyberthreading is disabled.

I was wondering whether I should migrate my code from current multi-processing model to multi-threading model. I am getting ~100% usage of the CPU with the current multi-processing code for all cores.

Although my current code is computing something different from what I am showing in this post. The model is the same: it is just simple for loop and each thread does not talk to each other at all. Therefore I would expect similar CPU usage as multi-processing. I am wondering what makes it difficult to consume all CPU power with multi-threading.

Are you comparing execution times using @time or @btime from BenchmarkTools?

There are definitely situations in which using threads or increasing the number of threads slows down the overall computation, so unless you are wanting to test your CPU cooler, maximizing CPU utilization is probably not the correct metric.

1 Like

I think you are expecting a lot if you want 100% usage on each core all of the time…

I would start by running on 2,4,8,16,24 cores and seeing what the scaling is like.

Also think of NUMA hits / misses - run numastat
Also a good tool to use is htop rather than top

I found this article Profiling tool wins and woes

Has anyone done much with with ‘perf’ and Julia code?
https://hpc-wiki.info/hpc/Perf

That’s not what your stats above show. If you’re not using BLAS operations, then setting BLAS threads to 1 and nthreads to 24 gave you a CPU utilization of 2365%.

When you’re using BLAS operations, setting nthreads to 1 and BLAS threads to 24 gave you 2363% utilization.

My point is that you should pick the threading model that works for your data. If you’re not using BLAS, or it’s not in your hot path, don’t set BLAS threads > 1. If you’re using BLAS extensively, prefer BLAS threads to Julia threads.

1 Like

When looking at performance, you have to consider the whole architecture, not just the CPU perfromance. The speed and layout of caches and the speed of main memory access.
I recently found this course which is very much worth following:
http://wgropp.cs.illinois.edu/courses/cs598-s16/index.htm

I guess with modern compilers this is hidden, but when I were a young thing people would tune the size of their arrays to be a multiple of the word length and to fit into the size of cache lines.
I would make those matrices 128x128 !!

Thank you.

I’ll try to learn more about high performance computing.

Do I understand correctly that, tasks that do not talk to each other are better handled with multiprocessing? Since if I change my original code to multiprocessing, it is 40% faster.