BLAS performance testing for Julia 1.8

peakflops(10000) gives me 1.94e11 on Julia 1.6.2 and 4.44e11 on

Version 1.8.0-DEV.702 (2021-10-11)
Commit bc89d8365d (0 days old master)

This is on a Ryzen 5950X machine. 16 cores and 32 threads. BLAS picks 32 threads in the latest build.

I recommend calling peakflops multiple times instead of using a larger size which takes a lot more memory and awful lot more time just to give a worse peak flops:

julia> @time maximum(peakflops() for _ in 1:10)
  1.283524 seconds (43.27 k allocations: 614.688 MiB, 4.74% gc time, 1.99% compilation time)
1.5731878913261862e11

julia> @time maximum(peakflops(10_000) for _ in 1:10)
206.243965 seconds (415.71 k allocations: 14.926 GiB, 0.62% gc time, 0.03% compilation time)
1.0870695287018651e11

I get higher results from larger values.

Are you on a laptop thatā€™s overheating as a thus throttling?

I use 16_000, add this is where I see the best results.

julia> using LinearAlgebra

julia> @time maximum(peakflops() for _ in 1:10)
  0.752822 seconds (3.66 M allocations: 801.841 MiB, 7.95% gc time, 76.95% compilation time)
1.6886556011175947e12

julia> @time maximum(peakflops(10_000) for _ in 1:10)
 11.147155 seconds (615.83 k allocations: 14.936 GiB, 1.96% gc time, 0.54% compilation time)
1.9820846150376135e12

julia> @time peakflops(16_000)
  4.226806 seconds (11 allocations: 3.815 GiB, 0.38% gc time)
2.0390490771767915e12

julia> versioninfo()
Julia Version 1.8.0-DEV.660
Commit 153db7a7a8* (2021-10-05 16:17 UTC)
Platform Info:
  OS: Linux (x86_64-generic-linux)
  CPU: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz

Only takes a single run for 16_000 to win. It consistently gets over 2e12.

This is with an 18 core CPU that has AVX512.

@tbeason what do you get if you set BLAS.set_num_threads(16)? The 10980XE should be close to 2x faster, not over 4x faster, than the 5950X.

julia> estimate_peak(; GHz = 4, ncores = (Sys.CPU_THREADS)::Int Ć· 2, fma_per_cycle=2, vector_width=8) = GHz*ncores*fma_per_cycle*2*vector_width # 2 = ops per fma
estimate_peak (generic function with 1 method)

julia> estimate_peak() # *1e9 = 2.3e12 estimated peak for a 10980XE clocked at 4GHz
2304

julia> estimate_peak(ncores=16, vector_width=4) # *1e9, so 1e12 estimated peak
1024

(*1e9 comes from GHz = 1e9Hz)
Meaning you should probably be able to get close to 1e12 instead of < 5e11.

Forcing BLAS to use 16 threads puts the results more in the 4e11 range instead of 4.4e11. Using 16000 instead of 10000 peakflops does result in a slight increase but still far short of 1e12

Iā€™m curious, how about 8 threads?
Wondering if only a single core complex can get comparable performance.

Also, if youā€™re willing to try Octavian (with a 7980XE):

julia> using Octavian

julia> M = K = N = 10_000; T = Float64; A = rand(T,M,K); B = rand(T,K,N); C0 = Array{T}(undef, M, N);

julia> C1 = @time(A*B);
  1.659304 seconds (45 allocations: 762.942 MiB, 2.00% gc time, 0.51% compilation time)

julia> @time(matmul!(C0,A,B)) ā‰ˆ C1
 24.181371 seconds (38.73 M allocations: 2.312 GiB, 1.41% gc time, 94.53% compilation time)
true

julia> @time(matmul!(C0,A,B)) ā‰ˆ C1
  1.240348 seconds
true

julia> 2M*K*N / 1.240348
1.6124506993198684e12

julia> 2M*K*N / @elapsed(matmul!(C0,A,B))
1.5091441211696343e12

julia> 2M*K*N / @elapsed(matmul!(C0,A,B))
1.3809532134370098e12

julia> 2M*K*N / @elapsed(matmul!(C0,A,B))
1.6087288779508767e12

julia> BLAS.set_num_threads(36)

julia> 2M*K*N / @elapsed(mul!(C0,A,B))
1.3121943004311746e12

julia> 2M*K*N / @elapsed(mul!(C0,A,B))
1.3067611981202083e12

julia> BLAS.set_num_threads(18)

julia> 2M*K*N / @elapsed(mul!(C0,A,B))
1.5145908595120166e12

julia> 2M*K*N / @elapsed(mul!(C0,A,B))
1.6394751354135144e12

julia> 2M*K*N / @elapsed(mul!(C0,A,B))
1.6367653198510771e12

julia> BLAS.set_num_threads(8)

julia> 2M*K*N / @elapsed(mul!(C0,A,B))
7.772799503515851e11

julia> 2M*K*N / @elapsed(mul!(C0,A,B))
7.769725165120771e11

julia> versioninfo()
Julia Version 1.8.0-DEV.709
Commit 1389c2fc4a* (2021-10-12 16:34 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i9-7980XE CPU @ 2.60GHz

Octavian got competitive performance with OpenBLAS here; Iā€™m curious if itā€™s able to do any better or also stumbles.

1 Like

A side comment / feature request :grinning:: We have JULIA_EXCLUSIVE=1 for compact pinning of Julia threads (i.e. pin 1:N Julia threads to the first 1:N cores). If we had more information about the system (Sockets / NUMA domains), we could also offer a ā€œscattered pinningā€, where Julia threads are pinned to cores from both sockets in an alternating fashion. This can have a big influence on performance (MFlops/s), see e.g. GitHub - JuliaPerf/BandwidthBenchmark.jl: Measuring memory bandwidth using TheBandwidthBenchmark (Also check it out if you just like unicode plots :smiley:).

But let me stop derailing this thread :slight_smile:

2 Likes

It would be important to the BLAS library to be aware of the type of pinning, if possible.
Youā€™d want to divide N among separate L3 caches, and have all cores that share L3 iterate over their N block together.

Octavian is currently very conservative because it assumes no thread pinning, meaning itā€™d underperform on systems with split L3, like the 5950X.
Iā€™m wondering just how badly it underperforms. OpenBLAS seems really bad on the 5950X as well.

Another thing to try on the 5950X is using MKL. Wondering if that does better.

2 Likes

Here you go.

julia> Threads.nthreads()
32

julia> using Octavian

julia> M = K = N = 10_000; T = Float64; A = rand(T,M,K); B = rand(T,K,N); C0 = Array{T}(undef, M, N);

julia> C1 = @time(A*B);
  4.784390 seconds (2.42 M allocations: 883.761 MiB, 0.19% gc time, 7.76% compilation time)

julia> @time(matmul!(C0,A,B)) ā‰ˆ C1
 14.185721 seconds (28.96 M allocations: 1.500 GiB, 0.80% gc time, 68.90% compilation time)
true

julia> @time(matmul!(C0,A,B)) ā‰ˆ C1
  4.304452 seconds
true

julia> 2M*K*N / @elapsed(matmul!(C0,A,B))
4.67596191784391e11

julia> 2M*K*N / @elapsed(matmul!(C0,A,B))
4.658076784902242e11

julia> using LinearAlgebra

julia> BLAS.get_num_threads()
32

julia> 2M*K*N / @elapsed(mul!(C0,A,B))
4.586317908376697e11

julia> 2M*K*N / @elapsed(mul!(C0,A,B))
4.5314859311069586e11

julia> BLAS.set_num_threads(16)

julia> 2M*K*N / @elapsed(mul!(C0,A,B))
4.0128038136403345e11

julia> 2M*K*N / @elapsed(mul!(C0,A,B))
3.914658181149202e11

julia> BLAS.set_num_threads(8)

julia> 2M*K*N / @elapsed(mul!(C0,A,B))
4.0726904378339746e11

julia> 2M*K*N / @elapsed(mul!(C0,A,B))
3.91119468204246e11

julia> versioninfo()
Julia Version 1.8.0-DEV.702
Commit bc89d8365d (2021-10-11 05:42 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: AMD Ryzen 9 5950X 16-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, znver3)

Thanks! I wonder if this is a Windows problem, e.g. maybe itā€™s restricting the process to a single 8 core complex, regardless of number of threads. Can you use a system monitor to watch core use?

The fact that you get the same performance out of 8 cores as you do out of 16 seems to imply that.

Anyone on Linux with a 5950X?

FWIW, HPC clusters and cloud environments likely already set the process affinity compatible with the underlying hardware and your resource request. IMHO, this is a more important limit we need to respect. Luckily, it seems OpenBLAS already respects the process affinity (https://github.com/JuliaLang/julia/pull/42340#issuecomment-932835362).

1 Like

I thought we disable CPU affinity in general. Maybe it gets respected if set at the OS level?

Maybe you are referring to https://github.com/JuliaLang/julia/pull/9639 ? Itā€™s a different issue. I was commenting that OpenBLAS is respecting the user-specified affinity to restrict the number of threads it uses. So, itā€™s different from #9639 which fixed the problem that julia didnā€™t respect the user-specified affinity and even was overwriting it.

Hi, thanks so much for making it happen. Can I ask if the number of max threads been decreased from 4096 to 1024? It seems to look so:

julia -p 128

returns:

OpenBLAS blas_thread_init: pthread_create failed for thread 127 of 128: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC 1024 current, 1024 max
ERROR: TaskFailedException

I am on Julia Version 1.8.0-DEV.829 (2021-10-27), Commit d71b77d7b6 (6 days old master). I will try the latest version shortly - I cannot do it now, thus the question.

Edit 1:
It seems that the similar situation is with julia -p 64.

Edit 2:
I managed to upgrade. Similar situation is on Version 1.8.0-DEV.875 (2021-11-02) Commit 7eba9c1d76 (0 days old master).

Edit 3:
I investigated the topic slightly further. When trying julia -p 128 or julia -p 64 I was getting the readings as presented above. As for now I understand that OpenBLAS spawns a thread count equal to the number of available logical processors by default, which is higher than RLIMIT_NPROC set by default on my machine. I understand that one can:

  • explicitly set the number of OpenBLAS threads by using: export OMP_NUM_THREADS=1 or i.e. start julia with OMP_NUM_THREADS=1 julia -p 64.
    or
  • implicitly change this number per user (on ubuntu) in /etc/security/limits.conf and/or system wide potentially in /etc/sysctl.conf.

Iā€™m pretty sure we are going to go with 512 threads on openblas, because of this:

https://github.com/xianyi/OpenBLAS/issues/3403

If you are doing multi-threaded Julia, you probably want openblas to use only 1 thread. We do this in Distributed, but not in multi-threading.

-viral

3 Likes

Iā€™m pretty sure we are going to go with 512 threads on openblas [ā€¦]

Hi! Sounds cool. Thanks!

If you are doing multi-threaded Julia, you probably want openblas to use only 1 thread. We do this in Distributed, but not in multi-threading.

Would you have any suggestion regarding Julia and BLAS setup for AlphaZero.jl run only on CPU machine? Why I am asking about it is that I have tried to consult it with several very knowledgeable persons and it seems that there are still some areas to be explored in this field thus I decided to mention this topic to you with a hope that maybe you could provide any suggestion.

The case is that if setup is CPU only machine with multiprocessing with one BLAS thread, I think that currently AlphaZero.jl is performing the first out of four parts of one iteration of the training on about 60 physical ice-lake cores in similar timing to calculations done on high end GPU (V100) which IMO is remarkable (and should be potentially even more remarkable with the next generation of x86 CPUs). However, if such setup, the next three parts are run only on one core, thus the overall timing for the iteration is rather long. Do you maybe think that it could be possible to somehow improve it with the proper Julia / BLAS setup? I am not a professional coder, however, I tried to provide as detailed info on my experience with AlphaZero.jl as possible in this thread: Questions on parallelization Ā· Issue #71 Ā· jonathan-laurent/AlphaZero.jl Ā· GitHub.

Also heaving the opportunity and as this is partially a thread about next Julia release, do you plan to fully support AVX512 (AFAIK, it is not currently fully supported - however I am not sure about it) in the future and if I may ask what is your position with regard to Julia emitting ā€œcompetitive SVE codeā€ on Neoverse N1 / A64FX (AFAIK, there are currently some constraints) so libraries like Octavian.jl could shine on those platforms?

Is this documented somewhere?

What does it mean? That if I set Distributed.addprocs(n), then automatically then number of BLAS threads (for each of the distributed procs) gets set to 1?

As far as I know this is a standard (default) setting on Julia 1.7.1 and recent lower versions. As for 1.8 the last time I checked (Version 1.8.0-DEV.1177 (2021-12-25)) I was constantly hitting default OS limits (1024 processes) when starting Julia with 1 thread on 160 cores (2 x 80) machine.

Iā€™m not sure I quite understand. We now allow OpenBLAS to pick the number of threads, but that shouldnā€™t run into the process limit. Can you file an issue describing exactly what is happening?

-viral

I donā€™t think it is. Should probably be documented in the section on distributed computing in the manual.

The default OS process limit is 1024?
To explain, I work in HPC and it is common to increase limits for processes and pinned memory to large values or maybe unlimited.