There is no Julia for AMD Epyc, is there?

I would like to run a test on an AMD EPYC 7543 32-Core Processor. I guess unless I want to build Julia myself, I am out of luck?

Generic x86_64 julia download will work fine. The system image is multi-versioned, so you won’t get fully tuned code out, but you’ll get a good enough version. Pkgimages compiled locally will be tuned to the microarchitecture.

2 Likes

One thing to note is that Epyc is the one system that where MKL will do horrible things to your performance.

3 Likes

Bad, but not horrible (I haven’t tried other examples):

julia> using LinearAlgebra, BenchmarkTools

julia> N = 500; A = rand(N,N); As = similar(A);

julia> BLAS.set_num_threads(1);

julia> @benchmark lu!(copyto!($As,$A))
BenchmarkTools.Trial: 1999 samples with 1 evaluation.
 Range (min … max):  2.467 ms …  4.590 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     2.484 ms              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   2.494 ms Β± 75.584 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

    β–ƒβ–„β–β–‚β–ˆβ–†β–†β–„β–‚β–‚                                                
  β–ƒβ–†β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–ˆβ–†β–…β–†β–…β–…β–…β–„β–„β–„β–„β–„β–„β–ƒβ–„β–ƒβ–ƒβ–„β–ƒβ–„β–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚ β–„
  2.47 ms        Histogram: frequency by time        2.57 ms <

 Memory estimate: 4.06 KiB, allocs estimate: 1.

julia> using MKL; BLAS.set_num_threads(1);

julia> @benchmark lu!(copyto!($As,$A))
BenchmarkTools.Trial: 1655 samples with 1 evaluation.
 Range (min … max):  2.987 ms …  5.348 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     3.009 ms              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   3.018 ms Β± 65.104 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

          β–…β–†β–ˆβ–ˆβ–‡β–„β–„β–‚                                            
  β–β–β–β–‚β–…β–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–…β–„β–„β–ƒβ–ƒβ–„β–ƒβ–‚β–‚β–‚β–ƒβ–‚β–ƒβ–β–‚β–ƒβ–‚β–‚β–ƒβ–‚β–‚β–β–‚β–‚β–‚β–β–β–‚β–‚β–‚β–β–β–β–β–β–β–β–β– β–ƒ
  2.99 ms        Histogram: frequency by time        3.09 ms <

 Memory estimate: 4.06 KiB, allocs estimate: 1.

Multithreading was worse:

julia> @benchmark lu!(copyto!($As,$A))
BenchmarkTools.Trial: 1243 samples with 1 evaluation.
 Range (min … max):  3.771 ms …  12.269 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     3.936 ms               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   4.009 ms Β± 500.821 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

    β–‚β–ˆβ–‚                                                        
  β–‚β–ƒβ–ˆβ–ˆβ–ˆβ–‡β–…β–…β–„β–…β–„β–…β–„β–‡β–ˆβ–†β–„β–„β–„β–ƒβ–„β–ƒβ–ƒβ–ƒβ–‚β–‚β–‚β–ƒβ–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–β–‚β–β–‚β–‚β–β–β–β–‚β–β–β–β–β–β–β–β–β–β–‚ β–ƒ
  3.77 ms         Histogram: frequency by time        4.84 ms <

 Memory estimate: 4.06 KiB, allocs estimate: 1.

julia> using MKL

julia> @benchmark lu!(copyto!($As,$A))
BenchmarkTools.Trial: 535 samples with 1 evaluation.
 Range (min … max):  2.191 ms … 127.404 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     7.341 ms               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   9.345 ms Β±   8.530 ms  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

    β–ƒβ–…β–‡β–…β–‡β–†β–ˆβ–„β–ƒβ–‚β–‚β–β–‚ ▁ ▁                                          
  β–„β–…β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–ˆβ–β–ˆβ–…β–β–…β–†β–„β–β–„β–„β–…β–†β–„β–„β–„β–…β–β–„β–„β–β–…β–„β–„β–β–„β–…β–β–„β–β–„β–β–β–β–β–β–β–…β–„β–β–β–β–… β–‡
  2.19 ms      Histogram: log(frequency) by time      43.4 ms <

 Memory estimate: 4.06 KiB, allocs estimate: 1.

9 ms mean time for MKL, vs 4ms for OpenBLAS!

Note that these were using the same size; OpenBLAS also slowed down by using multiple threads. RFLU was the clear winner for 500x500 LU on Epyc.

julia> @benchmark RecursiveFactorization.lu!(copyto!($As,$A))
BenchmarkTools.Trial: 2403 samples with 1 evaluation.
 Range (min … max):  2.050 ms …  2.707 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     2.071 ms              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   2.077 ms Β± 23.353 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

       β–…β–β–‚β–„β–ˆβ–…β–‡β–…β–„β–…β–‚β–‚β–‚                                          
  β–‚β–‚β–ƒβ–ƒβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–‡β–…β–…β–…β–…β–…β–„β–„β–„β–„β–ƒβ–ƒβ–„β–ƒβ–ƒβ–„β–…β–ƒβ–‚β–ƒβ–ƒβ–‚β–‚β–ƒβ–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β– β–ƒ
  2.05 ms        Histogram: frequency by time        2.13 ms <

 Memory estimate: 4.06 KiB, allocs estimate: 1.
1 Like

The really big problem is that if you don’t set threads to 1 manually, it will try to use 64 threads to do a 10x10 matmul and is about 100x slower.

I don’t see that.
I do see that it is worse than OpenBLAS, but not 100x slower:

julia> N = 10; A = rand(N,N); As = similar(A);

julia> using LinearAlgebra, BenchmarkTools

julia> @benchmark mul!($As, $A, $A)
BenchmarkTools.Trial: 10000 samples with 331 evaluations.
 Range (min … max):  263.834 ns …  1.244 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     282.414 ns              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   282.021 ns Β± 14.744 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

        ▁▁     ▂▅▄▃▃▃▂▁▁▂▃▂ ▁  β–β–‡β–ˆβ–…β–ƒβ–†β–„β–β–„β–ƒ β–‚β–‚  ▂▁ β–‚β–‚            β–‚
  β–…β–„β–β–„β–…β–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–ˆβ–ˆβ–ˆβ–‡β–ˆβ–ˆβ–‡β–†β–‡β–‡β–†β–…β–†β–…β–†β–„β–… β–ˆ
  264 ns        Histogram: log(frequency) by time       299 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> using MKL

julia> @benchmark mul!($As, $A, $A)
BenchmarkTools.Trial: 10000 samples with 227 evaluations.
 Range (min … max):  324.621 ns …  2.547 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     329.559 ns              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   330.965 ns Β± 30.569 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

    β–ƒβ–…β–„β–„β–ˆβ–‡β–†β–…β–‡β–…β–                                                 
  β–ƒβ–†β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–„β–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–„β–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–‚β–β–β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–β–‚ β–ƒ
  325 ns          Histogram: frequency by time          361 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> BLAS.set_num_threads(32);

julia> @benchmark mul!($As, $A, $A)
BenchmarkTools.Trial: 10000 samples with 232 evaluations.
 Range (min … max):  322.366 ns …  21.595 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     327.625 ns               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   330.724 ns Β± 212.910 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

   β–ˆβ–‡β–„β–β–‚   ▁▁      β–‚β–…β–‚                                           
  β–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–‡β–ˆβ–ˆβ–ˆβ–‡β–‡β–†β–„β–…β–ˆβ–ˆβ–ˆβ–ˆβ–…β–„β–ƒβ–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–β–‚β–‚β–‚β–‚β–‚ β–„
  322 ns           Histogram: frequency by time          355 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> versioninfo()
Julia Version 1.10.3
Commit 0b4590a5507 (2024-04-30 10:59 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 64 Γ— AMD EPYC 7513 32-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, znver3)
Threads: 64 default, 0 interactive, 32 GC (on 64 virtual cores)
Environment:
  LD_UN_PATH = /usr/local/lib/x86_64-unknown-linux-gnu/:/usr/local/lib/
  LD_LIBRARY_PATH = /usr/local/lib/x86_64-unknown-linux-gnu/:/usr/local/lib/
  JULIA_PATH = @.
  JULIA_NUM_THREADS = 64

julia> BLAS.get_config()
LinearAlgebra.BLAS.LBTConfig
Libraries: 
β”œ [ILP64] libmkl_rt.so
β”” [ LP64] libmkl_rt.so

julia> BLAS.set_num_threads(1);

julia> @benchmark mul!($As, $A, $A)
BenchmarkTools.Trial: 10000 samples with 232 evaluations.
 Range (min … max):  322.284 ns …  9.219 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     324.651 ns              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   326.579 ns Β± 89.074 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–…β–ˆβ–ˆβ–ˆβ–‡β–‡β–‡β–†β–†β–…β–„β–„β–ƒβ–‚β–β–   ▁▁  ▁                                     β–ƒ
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–‡β–‡β–‡β–‡β–†β–†β–ƒβ–†β–ƒβ–β–…β–„β–„β–…β–†β–†β–†β–†β–†β–†β–†β–„β–…β–ƒβ–…β–…β–„β–ƒβ–ƒβ–„β–β–… β–ˆ
  322 ns        Histogram: log(frequency) by time       353 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Maybe you’re thinking of BLIS, or perhaps MKL fixed it?