I would like to run a test on an AMD EPYC 7543 32-Core Processor. I guess unless I want to build Julia myself, I am out of luck?
Generic x86_64 julia download will work fine. The system image is multi-versioned, so you wonβt get fully tuned code out, but youβll get a good enough version. Pkgimages compiled locally will be tuned to the microarchitecture.
One thing to note is that Epyc is the one system that where MKL will do horrible things to your performance.
Bad, but not horrible (I havenβt tried other examples):
julia> using LinearAlgebra, BenchmarkTools
julia> N = 500; A = rand(N,N); As = similar(A);
julia> BLAS.set_num_threads(1);
julia> @benchmark lu!(copyto!($As,$A))
BenchmarkTools.Trial: 1999 samples with 1 evaluation.
Range (min β¦ max): 2.467 ms β¦ 4.590 ms β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 2.484 ms β GC (median): 0.00%
Time (mean Β± Ο): 2.494 ms Β± 75.584 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
ββββββββββ
ββββββββββββββββ
ββ
β
β
ββββββββββββββββββββββββββββββββββββββ β
2.47 ms Histogram: frequency by time 2.57 ms <
Memory estimate: 4.06 KiB, allocs estimate: 1.
julia> using MKL; BLAS.set_num_threads(1);
julia> @benchmark lu!(copyto!($As,$A))
BenchmarkTools.Trial: 1655 samples with 1 evaluation.
Range (min β¦ max): 2.987 ms β¦ 5.348 ms β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 3.009 ms β GC (median): 0.00%
Time (mean Β± Ο): 3.018 ms Β± 65.104 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
β
βββββββ
βββββ
βββββββββββββββ
ββββββββββββββββββββββββββββββββββββββ β
2.99 ms Histogram: frequency by time 3.09 ms <
Memory estimate: 4.06 KiB, allocs estimate: 1.
Multithreading was worse:
julia> @benchmark lu!(copyto!($As,$A))
BenchmarkTools.Trial: 1243 samples with 1 evaluation.
Range (min β¦ max): 3.771 ms β¦ 12.269 ms β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 3.936 ms β GC (median): 0.00%
Time (mean Β± Ο): 4.009 ms Β± 500.821 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
βββ
βββββββ
β
ββ
ββ
βββββββββββββββββββββββββββββββββββββββββββββββ β
3.77 ms Histogram: frequency by time 4.84 ms <
Memory estimate: 4.06 KiB, allocs estimate: 1.
julia> using MKL
julia> @benchmark lu!(copyto!($As,$A))
BenchmarkTools.Trial: 535 samples with 1 evaluation.
Range (min β¦ max): 2.191 ms β¦ 127.404 ms β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 7.341 ms β GC (median): 0.00%
Time (mean Β± Ο): 9.345 ms Β± 8.530 ms β GC (mean Β± Ο): 0.00% Β± 0.00%
ββ
ββ
βββββββββ β β
ββ
ββββββββββββββββββ
ββ
ββββββ
βββββ
βββββ
βββββ
βββββββββββ
βββββ
β
2.19 ms Histogram: log(frequency) by time 43.4 ms <
Memory estimate: 4.06 KiB, allocs estimate: 1.
9 ms mean time for MKL, vs 4ms for OpenBLAS!
Note that these were using the same size; OpenBLAS also slowed down by using multiple threads. RFLU was the clear winner for 500x500 LU on Epyc.
julia> @benchmark RecursiveFactorization.lu!(copyto!($As,$A))
BenchmarkTools.Trial: 2403 samples with 1 evaluation.
Range (min β¦ max): 2.050 ms β¦ 2.707 ms β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 2.071 ms β GC (median): 0.00%
Time (mean Β± Ο): 2.077 ms Β± 23.353 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
β
βββββ
ββ
ββ
βββ
ββββββββββββββββββββββ
β
β
β
β
βββββββββββ
βββββββββββββββββββββ β
2.05 ms Histogram: frequency by time 2.13 ms <
Memory estimate: 4.06 KiB, allocs estimate: 1.
The really big problem is that if you donβt set threads to 1 manually, it will try to use 64 threads to do a 10x10 matmul and is about 100x slower.
I donβt see that.
I do see that it is worse than OpenBLAS, but not 100x slower:
julia> N = 10; A = rand(N,N); As = similar(A);
julia> using LinearAlgebra, BenchmarkTools
julia> @benchmark mul!($As, $A, $A)
BenchmarkTools.Trial: 10000 samples with 331 evaluations.
Range (min β¦ max): 263.834 ns β¦ 1.244 ΞΌs β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 282.414 ns β GC (median): 0.00%
Time (mean Β± Ο): 282.021 ns Β± 14.744 ns β GC (mean Β± Ο): 0.00% Β± 0.00%
ββ ββ
ββββββββββ β ββββ
ββββββ ββ ββ ββ β
β
ββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββ
βββ
β
264 ns Histogram: log(frequency) by time 299 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> using MKL
julia> @benchmark mul!($As, $A, $A)
BenchmarkTools.Trial: 10000 samples with 227 evaluations.
Range (min β¦ max): 324.621 ns β¦ 2.547 ΞΌs β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 329.559 ns β GC (median): 0.00%
Time (mean Β± Ο): 330.965 ns Β± 30.569 ns β GC (mean Β± Ο): 0.00% Β± 0.00%
ββ
ββββββ
ββ
β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
325 ns Histogram: frequency by time 361 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> BLAS.set_num_threads(32);
julia> @benchmark mul!($As, $A, $A)
BenchmarkTools.Trial: 10000 samples with 232 evaluations.
Range (min β¦ max): 322.366 ns β¦ 21.595 ΞΌs β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 327.625 ns β GC (median): 0.00%
Time (mean Β± Ο): 330.724 ns Β± 212.910 ns β GC (mean Β± Ο): 0.00% Β± 0.00%
βββββ ββ ββ
β
βββββββββββββββββ
βββββ
βββββββββββββββββββββββββββββββββββββββ β
322 ns Histogram: frequency by time 355 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> versioninfo()
Julia Version 1.10.3
Commit 0b4590a5507 (2024-04-30 10:59 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 64 Γ AMD EPYC 7513 32-Core Processor
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-15.0.7 (ORCJIT, znver3)
Threads: 64 default, 0 interactive, 32 GC (on 64 virtual cores)
Environment:
LD_UN_PATH = /usr/local/lib/x86_64-unknown-linux-gnu/:/usr/local/lib/
LD_LIBRARY_PATH = /usr/local/lib/x86_64-unknown-linux-gnu/:/usr/local/lib/
JULIA_PATH = @.
JULIA_NUM_THREADS = 64
julia> BLAS.get_config()
LinearAlgebra.BLAS.LBTConfig
Libraries:
β [ILP64] libmkl_rt.so
β [ LP64] libmkl_rt.so
julia> BLAS.set_num_threads(1);
julia> @benchmark mul!($As, $A, $A)
BenchmarkTools.Trial: 10000 samples with 232 evaluations.
Range (min β¦ max): 322.284 ns β¦ 9.219 ΞΌs β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 324.651 ns β GC (median): 0.00%
Time (mean Β± Ο): 326.579 ns Β± 89.074 ns β GC (mean Β± Ο): 0.00% Β± 0.00%
β
βββββββββ
ββββββ ββ β β
βββββββββββββββββββββββββββββββββββββββ
βββ
βββββββββ
ββ
β
ββββββ
β
322 ns Histogram: log(frequency) by time 353 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
Maybe youβre thinking of BLIS, or perhaps MKL fixed it?