peakflops(10000)
gives me 1.94e11 on Julia 1.6.2 and 4.44e11 on
Version 1.8.0-DEV.702 (2021-10-11)
Commit bc89d8365d (0 days old master)
This is on a Ryzen 5950X machine. 16 cores and 32 threads. BLAS picks 32 threads in the latest build.
peakflops(10000)
gives me 1.94e11 on Julia 1.6.2 and 4.44e11 on
Version 1.8.0-DEV.702 (2021-10-11)
Commit bc89d8365d (0 days old master)
This is on a Ryzen 5950X machine. 16 cores and 32 threads. BLAS picks 32 threads in the latest build.
I recommend calling peakflops
multiple times instead of using a larger size which takes a lot more memory and awful lot more time just to give a worse peak flops:
julia> @time maximum(peakflops() for _ in 1:10)
1.283524 seconds (43.27 k allocations: 614.688 MiB, 4.74% gc time, 1.99% compilation time)
1.5731878913261862e11
julia> @time maximum(peakflops(10_000) for _ in 1:10)
206.243965 seconds (415.71 k allocations: 14.926 GiB, 0.62% gc time, 0.03% compilation time)
1.0870695287018651e11
I get higher results from larger values.
Are you on a laptop thatās overheating as a thus throttling?
I use 16_000
, add this is where I see the best results.
julia> using LinearAlgebra
julia> @time maximum(peakflops() for _ in 1:10)
0.752822 seconds (3.66 M allocations: 801.841 MiB, 7.95% gc time, 76.95% compilation time)
1.6886556011175947e12
julia> @time maximum(peakflops(10_000) for _ in 1:10)
11.147155 seconds (615.83 k allocations: 14.936 GiB, 1.96% gc time, 0.54% compilation time)
1.9820846150376135e12
julia> @time peakflops(16_000)
4.226806 seconds (11 allocations: 3.815 GiB, 0.38% gc time)
2.0390490771767915e12
julia> versioninfo()
Julia Version 1.8.0-DEV.660
Commit 153db7a7a8* (2021-10-05 16:17 UTC)
Platform Info:
OS: Linux (x86_64-generic-linux)
CPU: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
Only takes a single run for 16_000 to win. It consistently gets over 2e12.
This is with an 18 core CPU that has AVX512.
@tbeason what do you get if you set BLAS.set_num_threads(16)
? The 10980XE should be close to 2x faster, not over 4x faster, than the 5950X.
julia> estimate_peak(; GHz = 4, ncores = (Sys.CPU_THREADS)::Int Ć· 2, fma_per_cycle=2, vector_width=8) = GHz*ncores*fma_per_cycle*2*vector_width # 2 = ops per fma
estimate_peak (generic function with 1 method)
julia> estimate_peak() # *1e9 = 2.3e12 estimated peak for a 10980XE clocked at 4GHz
2304
julia> estimate_peak(ncores=16, vector_width=4) # *1e9, so 1e12 estimated peak
1024
(*1e9 comes from GHz = 1e9Hz
)
Meaning you should probably be able to get close to 1e12 instead of < 5e11.
Forcing BLAS to use 16 threads puts the results more in the 4e11 range instead of 4.4e11. Using 16000 instead of 10000 peakflops does result in a slight increase but still far short of 1e12
Iām curious, how about 8 threads?
Wondering if only a single core complex can get comparable performance.
Also, if youāre willing to try Octavian (with a 7980XE):
julia> using Octavian
julia> M = K = N = 10_000; T = Float64; A = rand(T,M,K); B = rand(T,K,N); C0 = Array{T}(undef, M, N);
julia> C1 = @time(A*B);
1.659304 seconds (45 allocations: 762.942 MiB, 2.00% gc time, 0.51% compilation time)
julia> @time(matmul!(C0,A,B)) ā C1
24.181371 seconds (38.73 M allocations: 2.312 GiB, 1.41% gc time, 94.53% compilation time)
true
julia> @time(matmul!(C0,A,B)) ā C1
1.240348 seconds
true
julia> 2M*K*N / 1.240348
1.6124506993198684e12
julia> 2M*K*N / @elapsed(matmul!(C0,A,B))
1.5091441211696343e12
julia> 2M*K*N / @elapsed(matmul!(C0,A,B))
1.3809532134370098e12
julia> 2M*K*N / @elapsed(matmul!(C0,A,B))
1.6087288779508767e12
julia> BLAS.set_num_threads(36)
julia> 2M*K*N / @elapsed(mul!(C0,A,B))
1.3121943004311746e12
julia> 2M*K*N / @elapsed(mul!(C0,A,B))
1.3067611981202083e12
julia> BLAS.set_num_threads(18)
julia> 2M*K*N / @elapsed(mul!(C0,A,B))
1.5145908595120166e12
julia> 2M*K*N / @elapsed(mul!(C0,A,B))
1.6394751354135144e12
julia> 2M*K*N / @elapsed(mul!(C0,A,B))
1.6367653198510771e12
julia> BLAS.set_num_threads(8)
julia> 2M*K*N / @elapsed(mul!(C0,A,B))
7.772799503515851e11
julia> 2M*K*N / @elapsed(mul!(C0,A,B))
7.769725165120771e11
julia> versioninfo()
Julia Version 1.8.0-DEV.709
Commit 1389c2fc4a* (2021-10-12 16:34 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Core(TM) i9-7980XE CPU @ 2.60GHz
Octavian got competitive performance with OpenBLAS here; Iām curious if itās able to do any better or also stumbles.
A side comment / feature request : We have JULIA_EXCLUSIVE=1
for compact pinning of Julia threads (i.e. pin 1:N Julia threads to the first 1:N cores). If we had more information about the system (Sockets / NUMA domains), we could also offer a āscattered pinningā, where Julia threads are pinned to cores from both sockets in an alternating fashion. This can have a big influence on performance (MFlops/s), see e.g. GitHub - JuliaPerf/BandwidthBenchmark.jl: Measuring memory bandwidth using TheBandwidthBenchmark (Also check it out if you just like unicode plots ).
But let me stop derailing this thread
It would be important to the BLAS library to be aware of the type of pinning, if possible.
Youād want to divide N
among separate L3 caches, and have all cores that share L3 iterate over their N
block together.
Octavian is currently very conservative because it assumes no thread pinning, meaning itād underperform on systems with split L3, like the 5950X.
Iām wondering just how badly it underperforms. OpenBLAS seems really bad on the 5950X as well.
Another thing to try on the 5950X is using MKL
. Wondering if that does better.
Here you go.
julia> Threads.nthreads()
32
julia> using Octavian
julia> M = K = N = 10_000; T = Float64; A = rand(T,M,K); B = rand(T,K,N); C0 = Array{T}(undef, M, N);
julia> C1 = @time(A*B);
4.784390 seconds (2.42 M allocations: 883.761 MiB, 0.19% gc time, 7.76% compilation time)
julia> @time(matmul!(C0,A,B)) ā C1
14.185721 seconds (28.96 M allocations: 1.500 GiB, 0.80% gc time, 68.90% compilation time)
true
julia> @time(matmul!(C0,A,B)) ā C1
4.304452 seconds
true
julia> 2M*K*N / @elapsed(matmul!(C0,A,B))
4.67596191784391e11
julia> 2M*K*N / @elapsed(matmul!(C0,A,B))
4.658076784902242e11
julia> using LinearAlgebra
julia> BLAS.get_num_threads()
32
julia> 2M*K*N / @elapsed(mul!(C0,A,B))
4.586317908376697e11
julia> 2M*K*N / @elapsed(mul!(C0,A,B))
4.5314859311069586e11
julia> BLAS.set_num_threads(16)
julia> 2M*K*N / @elapsed(mul!(C0,A,B))
4.0128038136403345e11
julia> 2M*K*N / @elapsed(mul!(C0,A,B))
3.914658181149202e11
julia> BLAS.set_num_threads(8)
julia> 2M*K*N / @elapsed(mul!(C0,A,B))
4.0726904378339746e11
julia> 2M*K*N / @elapsed(mul!(C0,A,B))
3.91119468204246e11
julia> versioninfo()
Julia Version 1.8.0-DEV.702
Commit bc89d8365d (2021-10-11 05:42 UTC)
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: AMD Ryzen 9 5950X 16-Core Processor
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-12.0.1 (ORCJIT, znver3)
Thanks! I wonder if this is a Windows problem, e.g. maybe itās restricting the process to a single 8 core complex, regardless of number of threads. Can you use a system monitor to watch core use?
The fact that you get the same performance out of 8 cores as you do out of 16 seems to imply that.
Anyone on Linux with a 5950X?
FWIW, HPC clusters and cloud environments likely already set the process affinity compatible with the underlying hardware and your resource request. IMHO, this is a more important limit we need to respect. Luckily, it seems OpenBLAS already respects the process affinity (https://github.com/JuliaLang/julia/pull/42340#issuecomment-932835362).
I thought we disable CPU affinity in general. Maybe it gets respected if set at the OS level?
Maybe you are referring to https://github.com/JuliaLang/julia/pull/9639 ? Itās a different issue. I was commenting that OpenBLAS is respecting the user-specified affinity to restrict the number of threads it uses. So, itās different from #9639 which fixed the problem that julia
didnāt respect the user-specified affinity and even was overwriting it.
Hi, thanks so much for making it happen. Can I ask if the number of max threads been decreased from 4096 to 1024? It seems to look so:
julia -p 128
returns:
OpenBLAS blas_thread_init: pthread_create failed for thread 127 of 128: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC 1024 current, 1024 max
ERROR: TaskFailedException
I am on Julia Version 1.8.0-DEV.829 (2021-10-27), Commit d71b77d7b6 (6 days old master)
. I will try the latest version shortly - I cannot do it now, thus the question.
Edit 1:
It seems that the similar situation is with julia -p 64
.
Edit 2:
I managed to upgrade. Similar situation is on Version 1.8.0-DEV.875 (2021-11-02) Commit 7eba9c1d76 (0 days old master)
.
Edit 3:
I investigated the topic slightly further. When trying julia -p 128
or julia -p 64
I was getting the readings as presented above. As for now I understand that OpenBLAS
spawns a thread count equal to the number of available logical processors by default, which is higher than RLIMIT_NPROC
set by default on my machine. I understand that one can:
OpenBLAS
threads by using: export OMP_NUM_THREADS=1
or i.e. start julia with OMP_NUM_THREADS=1 julia -p 64
.ubuntu
) in /etc/security/limits.conf
and/or system wide potentially in /etc/sysctl.conf
.Iām pretty sure we are going to go with 512 threads on openblas, because of this:
https://github.com/xianyi/OpenBLAS/issues/3403
If you are doing multi-threaded Julia, you probably want openblas to use only 1 thread. We do this in Distributed, but not in multi-threading.
-viral
Iām pretty sure we are going to go with 512 threads on openblas [ā¦]
Hi! Sounds cool. Thanks!
If you are doing multi-threaded Julia, you probably want openblas to use only 1 thread. We do this in Distributed, but not in multi-threading.
Would you have any suggestion regarding Julia and BLAS setup for AlphaZero.jl run only on CPU machine? Why I am asking about it is that I have tried to consult it with several very knowledgeable persons and it seems that there are still some areas to be explored in this field thus I decided to mention this topic to you with a hope that maybe you could provide any suggestion.
The case is that if setup is CPU only machine with multiprocessing with one BLAS thread, I think that currently AlphaZero.jl is performing the first out of four parts of one iteration of the training on about 60 physical ice-lake cores in similar timing to calculations done on high end GPU (V100) which IMO is remarkable (and should be potentially even more remarkable with the next generation of x86 CPUs). However, if such setup, the next three parts are run only on one core, thus the overall timing for the iteration is rather long. Do you maybe think that it could be possible to somehow improve it with the proper Julia / BLAS setup? I am not a professional coder, however, I tried to provide as detailed info on my experience with AlphaZero.jl as possible in this thread: Questions on parallelization Ā· Issue #71 Ā· jonathan-laurent/AlphaZero.jl Ā· GitHub.
Also heaving the opportunity and as this is partially a thread about next Julia release, do you plan to fully support AVX512 (AFAIK, it is not currently fully supported - however I am not sure about it) in the future and if I may ask what is your position with regard to Julia emitting ācompetitive SVE codeā on Neoverse N1 / A64FX (AFAIK, there are currently some constraints) so libraries like Octavian.jl could shine on those platforms?
Is this documented somewhere?
What does it mean? That if I set Distributed.addprocs(n)
, then automatically then number of BLAS threads (for each of the distributed procs) gets set to 1?
As far as I know this is a standard (default) setting on Julia 1.7.1 and recent lower versions. As for 1.8 the last time I checked (Version 1.8.0-DEV.1177 (2021-12-25)) I was constantly hitting default OS limits (1024 processes) when starting Julia with 1 thread on 160 cores (2 x 80) machine.
Iām not sure I quite understand. We now allow OpenBLAS to pick the number of threads, but that shouldnāt run into the process limit. Can you file an issue describing exactly what is happening?
-viral
I donāt think it is. Should probably be documented in the section on distributed computing in the manual.
The default OS process limit is 1024?
To explain, I work in HPC and it is common to increase limits for processes and pinned memory to large values or maybe unlimited.