Is there something that needs to be taken into account or configured for AMD Milan processors? I just ran the same code on a 2x24 core AMD EPYC 7443 machine and got this result:
julia> Threads.nthreads()
96
julia> using LinearAlgebra; BLAS.set_num_threads(Sys.CPU_THREADS ÷ 2);
julia> A = rand(10_000,10_000); B = similar(A);
julia> @time mul!(B, A, A);
3.391364 seconds (2.49 M allocations: 124.344 MiB, 12.61% compilation time)
julia> @time mul!(B, A, A);
3.178868 seconds
julia> using MKL
julia> @time mul!(B, A, A);
2.854096 seconds
julia> @time mul!(B, A, A);
2.724407 seconds
julia> using Octavian
julia> @time matmul!(B, A, A);
14.762420 seconds (28.56 M allocations: 1.495 GiB, 1.96% gc time, 80.99% compilation time)
julia> @time matmul!(B, A, A);
3.115852 seconds
julia> versioninfo()
Julia Version 1.8.0-DEV.1405
Commit 2010d95d8a (2022-01-26 17:41 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: AMD EPYC 7443 24-Core Processor
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-13.0.0 (ORCJIT, znver3)
Shouldn’t the 7513 and 7443 be quite comparable, despite the 8 more cores?