Zen3, 32 core Epyc:
julia> using LinearAlgebra; BLAS.set_num_threads(Sys.CPU_THREADS ÷ 2);
julia> A = rand(10_000,10_000); B = similar(A);
julia> @time mul!(B, A, A);
2.622767 seconds (2.42 M allocations: 123.016 MiB, 18.41% compilation time)
julia> @time mul!(B, A, A);
1.939154 seconds
julia> using MKL
julia> @time mul!(B, A, A);
1.430667 seconds
julia> @time mul!(B, A, A);
1.258703 seconds
julia> using Octavian
julia> @time matmul!(B, A, A);
16.063869 seconds (28.60 M allocations: 1.518 GiB, 1.96% gc time, 90.36% compilation time)
julia> @time matmul!(B, A, A);
1.301244 seconds
Cascadelake-X, 18 core AVX512:
julia> using LinearAlgebra; BLAS.set_num_threads(Sys.CPU_THREADS ÷ 2);
julia> A = rand(10_000,10_000); B = similar(A);
julia> @time mul!(B, A, A);
1.855408 seconds (2.51 M allocations: 124.746 MiB, 33.53% compilation time)
julia> @time mul!(B, A, A);
1.130044 seconds
julia> using MKL
julia> @time mul!(B, A, A);
1.129982 seconds
julia> @time mul!(B, A, A);
1.129533 seconds
julia> using Octavian
julia> @time matmul!(B, A, A);
26.057493 seconds (38.22 M allocations: 1.960 GiB, 2.76% gc time, 94.77% compilation time)
julia> @time matmul!(B, A, A);
1.273121 seconds
When I benchmarked earlier, performance was erratic across sizes on the Epyc, but this could be because other people were using it too, so you probably shouldn’t put much meaning on that, other than it’d be harder to get good plots of perf across size if the server is being used by other people.
It’s 2x 256 bit FMA / cycle / core * 32 cores vs 2x 512 bit FMA / cycle / core * 18 cores, so that the 10980XE does comparably well is to be expected.
But note, of course, that this is a very specific workload!
The 11900K is supposed to have 10%+ more IPC than the 10980XE, but for matrix multiply, it has half the IPC because while it overall has more execution units, they’re specifically lacking in 512 bit FMA ability.
Same goes for the alder lake CPUs.
Before spending a bunch of money on a CPU and motherboard, I’d consider all the kinds of workloads you care about, and how well your options do.
The 10980XE came out in 2019, and was already based on an old architecture then – it’s basically the same as the 7980XE, which was released in 2017.
Intel will hopefully come out with a Saphire Rapids version this year, which would be the server version of Alder lake, and should also have 2x 512 bit FMA units, along with a 2MiB L2 cache, and (at the top of the stack) more cores than the 10980XE.
Some rumors also suggest AMD might support AVX512 in Zen4 (in which case, is it like in Ice Lake Client/Rocket Lake/Tiger Lake/Alder Lake before it got disabled, and half-rate FMA, of full?).
So, while I do still think the 10980XE is a great chip for SIMD numerical workloads (and especially matmul) [and you should be able to find them for <$800, but that + a mother board and DDR4 that won’t support future chips is a lot of money], I’d suggest probably waiting a little longer since much newer chip architectures are on the horizon.