VectorizationBase seems to wrongly detect the number of the physical cores

Hi guys~ I’m here again.

Not long ago, I posted my first post here and got a lot of awesome answers about how to use multithreading and SIMD in julia, and the best practice about benchmarking.

After I figured out how to install packages in secure system (I’m playing with companies’ computing server), I’ve tested the performance of Julia.

The testing expression is to compute

@. output = (a - b) / (a + b) * log.((a + b) * K + 1)

where a, b, out are three large float64 Vector.

Taking the advice from @Oscar_Smith, this is what I wrote:

function f(a, b, K, c)
    @. @tturbo c = (a - b) / (a + b) * log.((a + b) * K + 1)
end

n = 100_000_000
a = rand(n)
b = rand(n)
c = rand(n)

@btime f(a, b, 1.1, c)

The result is surprising, since Julia is somehow faster than highly optimized C++ code. But still there is a confusing problem: Julia doesn’t use all of the cores in my CPU (which is AMD EPYC 7763 64core 2sockets processor)

I started Julia via Julia --threads=128, and I can see that the result of Threads.num_threads() is 128.

And when I execute Hwloc.topology_info(), it says:

Machine: 1 (2003.98 GB)
 Package: 2 (996.03 GB)
   NUMANode: 2 (996.03 GB)
      L3Cache: 16 (32.0 MB)
          L2Cache: 128 (512.0 kB)
               L1Cache: 128 (32.0 kB)
                     Core: 128
                            PU: 128

This is reasonable, because I have 64 physical cores with 2 sockets, and due to some reason, I turned off hyperthreading.

But when I use htop to monitor the CPU usage, I found that only 64 threads are running.

Then I checked the documents LoopVectorization Doc indicating that @tturbo will only use min( Threads.nthreads(),VectorizationBase.num_cores() ) threads. Sadly, the result of VectorizationBase.num_cores() is 64, which means that it will only use 64 threads.

I’m not sure whether more threads will bring better performance, but I’m curious about how to hack this constraint or is there any work around?

Thanks in advance!

1 Like

CPUSummary.num_cores is wrong.

Once upon a time, it was critically important that I supported running under WINE.
using Hwloc.jl would segfault Julia under WINE, so after CPUSummary 1.8, I dropped Hwloc.

I haven’t heard anything about WINE for a while, but I don’t think supporting it is still necessary.
so the solution might be to basically revert CPUSummary to its 1.8 version.

A PR to that effect would be welcome.
There are probably a few updates in behavior needed. Three that come to mind:

  1. Delete CPUSummary.num_threads that exists in 1.8 but not in the latest version.
  2. The meaning of the cache_size functions may also have changed since then.
  3. Maybe hwloc supports the M1 now? Does it detect big vs little cores? Either way, at the time, Julia reported the 4 bit/4 small M1 as 8 threads, but now it is 4. The 8 big core version was also released since then, so my hack at the time is no longer appropriate.
1 Like

So much thanks!

I need to dig deeper into that, and after I fully understand what happens here I will try to give a PR.

VectorizationBase is wrong because it is using CPUSummary.
CPUSummary is wrong because it hardcores the number of nodes as 1.

1 Like

Ah, now I think I know what’s happening here.

When I remove the current CPUSummary and add CPUSummary@0.1.8, I can see that LoopVectorization.num_cores() gives 128, which is correct.

However, there are still 64 threads running when I run the code shown above, and the CPU usage rate is even lower, so basically it didn’t bring anything better.

My current guess is that function choose_num_threads in LoopVectorization has something wrong causing this bizzare phenomenon

Yes, pretty much. You can definitely get the different cache sizes for the two kinds of cores.