@threads uses only half the number of nthreads()

Recently I built a PC with Ryzen Threadripper 3990X, which has 64 cores and 128 threads.
The OS is Windows 10 Pro.

When I tried the code below, only 64 threads reached 100% as shown in the attached image.
Why aren’t all the threads being used?

julia> using Base.Threads
julia> nthreads()
128
julia> @threads for i in 1:10000 rand(100000) end

this takes 0.8 s with 16 threads on my 4900HS, can imagine that your 3990X is simply to fast for

What happens when you generate more random numbers?

@threads for i in 1:10000 rand(100000,50) end

If you’re bound by computational power, using more threads than CPU cores usually does not help - since each core can only be busy with one thread at a time. Such hyperthreading can help if a task is waiting for something, but is usually does not help in compute-bound scenarios.

2 Likes

Something I would try is run a second instance of julia in a new terminal window and run the @threads command in both. Does that use all 128 cores? If it doesn’t then check your BIOS to see if hyperthreading is disabled. Or check windows, maybe your version doesn’t use more than 64 cores?

@pixel27

Or check windows, maybe your version doesn’t use more than 64 cores?

Thankyou! Now, I found the following article.

AMD has come forward to clarify issues around its Ryzen Threadripper 3990X flagship 64-core HEDT processor, which as you may have seen, has recently been reported as not running with its full capabilities on Windows 10 Pro – due to the OS not being able to handle 128-threads.

AMD sent out a statement as follows: “Higher editions/versions of Windows 10 confer no additional performance or compatibility benefits to the processor. We understand that this suggestion has been made in the media, but we believe this to be an error in testing that our team is presently diagnosing.”

I will check the version of my own, update and try again!

PS: Hmmm, my Windows 10 Pro is already the newest version…

@ranocha
I see! I didn’t know much about hyperthreading.
So, the CPU may be using its full capacity even though it is apparently using only 64 threads (when this is the case using 64 cores.).

@MatFi

this takes 0.8 s with 16 threads on my 4900HS, can imagine that your 3990X is simply to fast for

Oh… With my 3990X, it takes 2.5 seconds…

julia> @benchmark @threads for i in 1:10000 rand(100000) end

BenchmarkTools.Trial:
  memory estimate:  7.45 GiB
  allocs estimate:  20643
  --------------
  minimum time:     1.698 s (26.38% GC)
  median time:      2.487 s (37.38% GC)
  mean time:        4.204 s (62.77% GC)
  maximum time:     8.427 s (77.59% GC)
  --------------
  samples:          3
  evals/sample:     1

Now I tried to turn off SMT, 64 cores and 64 threads (logical cpu cores).
The result was better than before.

julia> @benchmark @threads for i in 1:10000 rand(100000) end

BenchmarkTools.Trial:
  memory estimate:  7.45 GiB
  allocs estimate:  20324
  --------------
  minimum time:     944.958 ms (32.53% GC)
  median time:      1.006 s (32.18% GC)
  mean time:        1.231 s (36.41% GC)
  maximum time:     1.885 s (42.55% GC)
  --------------
  samples:          5
  evals/sample:     1

I would like to use this for a while.
Thank you for everyone! @MatFi @ranocha @pixel27

This issue is almost certainly because of windows processor thread groups.

On dual socket systems, or more than 64 threads, by default, threads created in windows by a process will all be placed on the same physical core or subset of the threads.

This problem is easy to work around with a little code in thread creation in c++, and I have writtent his code for my c++ code. If I can figure out how to put it in Julia for windows, I’ll take a stab at it.

I have a 128 core/256 thread AMD system. Julia threads on it will only use 1/4 of the cores :-(.

If anyone could help me with where I could code this, its a simple loop that iterates the thread handles and does a couple windows calls.
Perhaps it can be done wittout modifying julia or its libraries at all, if I can iterate the windows htread handles and make some calls into the windows dlls.

this benchmark is RAM bottlenecked.

what does Threads.nthreads() say? Also MWE snippet? (probably in a new thread)

This is a well known misfeature of the windows OS that needs a minor bit of code at thread creation time to work around. thread libraries like intel TBB do this for you, but otherwise you need a few lines of code.
You can see an article about it here, including how to fix it in c code:

In my test case, there are 250 windows threads created by Julia as I asked, perhaps by using pthreads which I assume lacks this code as well.

When they run in a tight loop that is not memory bound, you will see 256 non-idle threads, but only 1/4 CPU usage in the task manager. Using any good windows diagnostic tool, you can see that julia has created 250 threads, but they are all assigned to the same thread group instead of using the 4 groups that are present in this machine. Thus, the 256 OS threads only use 64 HW threads out of the 256 available

If you run a third party program called “process lasso”, it can be told to “spread the threads out over all the processor groups” for a running process. Applying this to julia makes it able to take advantage of all the cores. You will see that now the process manager shows 100% CPU usage on all cores, and that you get corresponding speedups.

I managed to get a 150x speedup vs single threaded code on this 128/256 core system by doing this.

Memory bandwidth doesn’t figure into this. Even if the threads were horribly memory bound, they would consume 100% of all of the cpus, though they might be spending more time servicing cache misses then doing math, and so not see a scalable speedup.

Its an easy problem to repro, and an easy one tix if you have a system to test on.