Thread affinitization: pinning Julia threads to cores

I’ve been trying around with ThreadPinning.jl related to this post here Current OpenBLAS Versions (January 2022) do not support Intel gen 11 performantly? - #30 by fgerick .

I have some trouble understanding this behaviour:

#julia -t 48 
julia> using BenchmarkTools, Hwloc, ThreadPinning, LinearAlgebra, Octavian; A = rand(5_000,5_000); B = similar(A);

julia> BLAS.get_num_threads()
96

julia> BLAS.set_num_threads(48)

julia> @btime mul!($B, $A, $A);
  626.977 ms (0 allocations: 0 bytes)

julia> BLAS.set_num_threads(24)

julia> @btime mul!($B, $A, $A);
  377.458 ms (0 allocations: 0 bytes)

julia> @btime matmul!($B, $A, $A);
  521.579 ms (0 allocations: 0 bytes)

julia> pinthreads(:compact)

julia> @btime matmul!($B, $A, $A);
  223.912 ms (0 allocations: 0 bytes)

julia> @btime matmul!($B, $A, $A);
  233.511 ms (0 allocations: 0 bytes)

julia> pinthreads(:compact)

julia> @btime matmul!($B, $A, $A);
  188.991 ms (0 allocations: 0 bytes)

julia> @btime matmul!($B, $A, $A);
  186.621 ms (0 allocations: 0 bytes)

julia> @btime mul!($B, $A, $A);
  380.789 ms (0 allocations: 0 bytes)

The OpenMP threads are not affected by ThreadPinning as I understand. However, why do I see different benchmarks for the pure Julia code in Octavian’s matmul! after pinthreads(:compact) the first time and the second time? Is there a way to start Julia on just one socket, without having to pin afterwards?