Hi guys~ I’m here again.
Not long ago, I posted my first post here and got a lot of awesome answers about how to use multithreading and SIMD in julia, and the best practice about benchmarking.
After I figured out how to install packages in secure system (I’m playing with companies’ computing server), I’ve tested the performance of Julia.
The testing expression is to compute
@. output = (a - b) / (a + b) * log.((a + b) * K + 1)
where a
, b
, out
are three large float64 Vector.
Taking the advice from @Oscar_Smith, this is what I wrote:
function f(a, b, K, c)
@. @tturbo c = (a - b) / (a + b) * log.((a + b) * K + 1)
end
n = 100_000_000
a = rand(n)
b = rand(n)
c = rand(n)
@btime f(a, b, 1.1, c)
The result is surprising, since Julia is somehow faster than highly optimized C++ code. But still there is a confusing problem: Julia doesn’t use all of the cores in my CPU (which is AMD EPYC 7763 64core 2sockets processor)
I started Julia via Julia --threads=128
, and I can see that the result of Threads.num_threads()
is 128.
And when I execute Hwloc.topology_info(), it says:
Machine: 1 (2003.98 GB)
Package: 2 (996.03 GB)
NUMANode: 2 (996.03 GB)
L3Cache: 16 (32.0 MB)
L2Cache: 128 (512.0 kB)
L1Cache: 128 (32.0 kB)
Core: 128
PU: 128
This is reasonable, because I have 64 physical cores with 2 sockets, and due to some reason, I turned off hyperthreading.
But when I use htop
to monitor the CPU usage, I found that only 64 threads are running.
Then I checked the documents LoopVectorization Doc indicating that @tturbo will only use min( Threads.nthreads(),VectorizationBase.num_cores() )
threads. Sadly, the result of VectorizationBase.num_cores()
is 64, which means that it will only use 64 threads.
I’m not sure whether more threads will bring better performance, but I’m curious about how to hack this constraint or is there any work around?
Thanks in advance!