VectorizationBase seems to wrongly detect the number of the physical cores

27rabbitlt · January 4, 2023, 8:35am

Hi guys~ I’m here again.

Not long ago, I posted my first post here and got a lot of awesome answers about how to use multithreading and SIMD in julia, and the best practice about benchmarking.

After I figured out how to install packages in secure system (I’m playing with companies’ computing server), I’ve tested the performance of Julia.

The testing expression is to compute

@. output = (a - b) / (a + b) * log.((a + b) * K + 1)

where a, b, out are three large float64 Vector.

Taking the advice from @Oscar_Smith, this is what I wrote:

function f(a, b, K, c)
    @. @tturbo c = (a - b) / (a + b) * log.((a + b) * K + 1)
end

n = 100_000_000
a = rand(n)
b = rand(n)
c = rand(n)

@btime f(a, b, 1.1, c)

The result is surprising, since Julia is somehow faster than highly optimized C++ code. But still there is a confusing problem: Julia doesn’t use all of the cores in my CPU (which is AMD EPYC 7763 64core 2sockets processor)

I started Julia via Julia --threads=128, and I can see that the result of Threads.num_threads() is 128.

And when I execute Hwloc.topology_info(), it says:

Machine: 1 (2003.98 GB)
 Package: 2 (996.03 GB)
   NUMANode: 2 (996.03 GB)
      L3Cache: 16 (32.0 MB)
          L2Cache: 128 (512.0 kB)
               L1Cache: 128 (32.0 kB)
                     Core: 128
                            PU: 128

This is reasonable, because I have 64 physical cores with 2 sockets, and due to some reason, I turned off hyperthreading.

But when I use htop to monitor the CPU usage, I found that only 64 threads are running.

Then I checked the documents LoopVectorization Doc indicating that @tturbo will only use min( Threads.nthreads(),VectorizationBase.num_cores() ) threads. Sadly, the result of VectorizationBase.num_cores() is 64, which means that it will only use 64 threads.

I’m not sure whether more threads will bring better performance, but I’m curious about how to hack this constraint or is there any work around?

Thanks in advance!

Elrod · January 4, 2023, 11:33am

CPUSummary.num_cores is wrong.

Once upon a time, it was critically important that I supported running under WINE.
using Hwloc.jl would segfault Julia under WINE, so after CPUSummary 1.8, I dropped Hwloc.

github.com

JuliaSIMD/CPUSummary.jl/blob/v0.1.8/src/CPUSummary.jl

module CPUSummary

using Static, Hwloc
using Static: Zero, One, gt, lt
using IfElse: ifelse

export cache_size, cache_linesize, cache_associativity, cache_type,
  cache_inclusive, num_cache, num_cores, num_threads

include("topology.jl")
const BASELINE_CORES = Int(num_cores()) * ((Sys.ARCH === :aarch64) && Sys.isapple() ? 2 : 1)
function __init__()
  Sys.isapple() && Sys.ARCH === :aarch64 && return # detect M1
  ccall(:jl_generating_output, Cint, ()) == 1 && return
  safe_topology_load!()
  if count_attr(:Core) ≢ BASELINE_CORES
    redefine_attr_count()
    foreach(redefine_cache, 1:4)
  end
  redefine_num_threads()

This file has been truncated. show original

I haven’t heard anything about WINE for a while, but I don’t think supporting it is still necessary.
so the solution might be to basically revert CPUSummary to its 1.8 version.

A PR to that effect would be welcome.
There are probably a few updates in behavior needed. Three that come to mind:

Delete CPUSummary.num_threads that exists in 1.8 but not in the latest version.
The meaning of the cache_size functions may also have changed since then.
Maybe hwloc supports the M1 now? Does it detect big vs little cores? Either way, at the time, Julia reported the 4 bit/4 small M1 as 8 threads, but now it is 4. The 8 big core version was also released since then, so my hack at the time is no longer appropriate.

27rabbitlt · January 5, 2023, 4:32am

So much thanks!

I need to dig deeper into that, and after I fully understand what happens here I will try to give a PR.

Elrod · January 5, 2023, 4:58am

VectorizationBase is wrong because it is using CPUSummary.
CPUSummary is wrong because it hardcores the number of nodes as 1.

27rabbitlt · January 5, 2023, 5:31am

Ah, now I think I know what’s happening here.

When I remove the current CPUSummary and add CPUSummary@0.1.8, I can see that LoopVectorization.num_cores() gives 128, which is correct.

However, there are still 64 threads running when I run the code shown above, and the CPU usage rate is even lower, so basically it didn’t bring anything better.

My current guess is that function choose_num_threads in LoopVectorization has something wrong causing this bizzare phenomenon

carstenbauer · January 5, 2023, 5:47am

Yes, pretty much. You can definitely get the different cache sizes for the two kinds of cores.

Topic		Replies	Views
How to set up number of threads appropriately based on Hardware? New to Julia question	18	3113	July 1, 2021
Does Julia detect the maximum number of threads availible and if so how? Performance threads	29	6027	November 19, 2021
JULIA_NUM_THREADS and physical cores New to Julia	3	3420	November 1, 2019
Why julia is not using all my CPU? General Usage	18	3823	April 25, 2020
Help wanted: benchmarking multi-threaded CPU performance Offtopic hardware	20	933	May 13, 2024

VectorizationBase seems to wrongly detect the number of the physical cores

Related topics