FWIW, I use Clear Linux on my machine with a 10980XE.
Iâm not sure how much difference this will make with Julia, given that you arenât running software from their repositories (and they donât provide Julia, meaning youâll either be running the official binaries, or building from source*), nor are you running software compiled using their aggressive default CFLAGS
environment variables.
But there may still be some random settings that make a difference, like setting transparent huge page tables to madvise by default. Foobarlv2 mentioned at least one distro (I donât recall which) with a different setting.
If you do try it, and do try building from source, it requires adding F_COMPILER=GFORTRAN
to open blasâs flags in deps/blas.mk
, due to open blasâs build system mistaking Clear Linuxâs gfortran for ifort, and thus passing the wrong compiler flags.
I should file an issue with OpenBLAS.
The only other problem Iâve had getting set up with the distro is that their fontconfig is in a different place than some software expects, so you need to set a path to it for VegaLite.jl
to find it and let you save plots, for example.
Otherwise, I like it. Simple, up to date, reliable.
Not an aesthetically pleasing setup:
julia> using LinearAlgebra
julia> BLAS.vendor()
:openblas64
julia> BLAS.set_num_threads(Sys.CPU_THREADS >> 1)
julia> @time peakflops(16_000) # I'd already precompiled this function, but forgot to set num threads
5.294615 seconds (12 allocations: 3.815 GiB, 0.29% gc time)
1.6221124310445762e12
after an add https://github.com/JuliaComputing/MKL.jl
julia> using LinearAlgebra
julia> BLAS.set_num_threads(Sys.CPU_THREADS >> 1)
julia> BLAS.vendor()
:mkl
julia> @time peakflops(16_000)
4.685193 seconds (3.08 M allocations: 3.955 GiB, 2.17% gc time)
2.1108278728990073e12
julia> @time peakflops(16_000)
4.116164 seconds (12 allocations: 3.815 GiB, 1.55% gc time)
2.1415206310497896e12
Thatâs over 2.1 teraflops. Iâve overclocked it to 4.1 GHz all-core AVX512 (all-core SSE and AVX(2) speeds are 4.6 and 4.3 GHz). That means the theoretical peak is
julia> 4.1 * 18 * 16 * 2
2361.6
julia> 4.1 * 18 * 16 * 2 / 1000
2.3615999999999997
2.36 teraflops. The numbers are 4.1 GHz (4.1 billion clock cycles / second) * 18 physical cores * 16 flops per double-precision avx512 fma * 2 fma / clock cycle = 2.36 trillion double precision floating point operations / second.
Interestingly, MKL may use Strassen on Haswell at large sizes, because on my employerâs HPC, I got more flops from MKL than the above calculation suggested possible. Or maybe I looked up the wrong CPU model for determining specs.
How well these CPUs do in various benchmarks however will depend on what the bottlenecks are, and how they perform with respect to those bottlenecks. The 3950X has a larger L3 cache than the 10980XE, so the 3950X will perform better in a benchmark dominated by memory bandwidth that fits in its cache but not he 10980âs cache, but the 10980 will do better if it fits in neither CPUâs cache because the 10980 has more memory channels.