I’ve been trying around with ThreadPinning.jl related to this post here Current OpenBLAS Versions (January 2022) do not support Intel gen 11 performantly? - #30 by fgerick .
I have some trouble understanding this behaviour:
#julia -t 48
julia> using BenchmarkTools, Hwloc, ThreadPinning, LinearAlgebra, Octavian; A = rand(5_000,5_000); B = similar(A);
julia> BLAS.get_num_threads()
96
julia> BLAS.set_num_threads(48)
julia> @btime mul!($B, $A, $A);
626.977 ms (0 allocations: 0 bytes)
julia> BLAS.set_num_threads(24)
julia> @btime mul!($B, $A, $A);
377.458 ms (0 allocations: 0 bytes)
julia> @btime matmul!($B, $A, $A);
521.579 ms (0 allocations: 0 bytes)
julia> pinthreads(:compact)
julia> @btime matmul!($B, $A, $A);
223.912 ms (0 allocations: 0 bytes)
julia> @btime matmul!($B, $A, $A);
233.511 ms (0 allocations: 0 bytes)
julia> pinthreads(:compact)
julia> @btime matmul!($B, $A, $A);
188.991 ms (0 allocations: 0 bytes)
julia> @btime matmul!($B, $A, $A);
186.621 ms (0 allocations: 0 bytes)
julia> @btime mul!($B, $A, $A);
380.789 ms (0 allocations: 0 bytes)
The OpenMP threads are not affected by ThreadPinning as I understand. However, why do I see different benchmarks for the pure Julia code in Octavian’s matmul!
after pinthreads(:compact)
the first time and the second time? Is there a way to start Julia on just one socket, without having to pin afterwards?