How to choose a workstation for optimal performance

Aside from AVX512, the biggest improvement in Rocket Lake over the previous generation is that their integrated graphics are much better.

Still worth pointing out that my Tiger Lake laptop has an 1165G7 which comes with 96 execution units, vs 32 in Rocket Lake. Other than meaning the laptop is 3x faster for graphics, I’m not sure what the ramifications are.
Does it matter if you don’t play games or use oneAPI.jl? Will you have problems streaming videos?

Looking online, if FP32 is an indicator, 96 EU of Xe graphics: 1_690 GFLOPS
GTX 650, which seems to sell for $60-$100: 812.5 GFLOPS
Presumably 32 EU like in Rocket Lake will get you around 560 GFLOPS.

Also on AVX512, it’s worth pointing out that there is segmentation between Intel’s HEDT/high end server chips and their low end server and consumer chips in throughput.
The chips can see huge benefit from AVX512, e.g. in AnanadTech’s test with 290W power draw the 11700K was more than 5x faster than the competition.

For example:

Matmul benchmarks on the 10980XE-HEDT (compilation, 32x32, 48x48, 72x72)
julia> using LoopVectorization, BenchmarkTools

julia> M = K = N = 32; A = rand(M,K); B = rand(K,N); C1 = @time(A * B); C0 = similar(C1);
  0.492570 seconds (2.48 M allocations: 132.454 MiB, 9.66% gc time, 99.92% compilation time)

julia> function AmulB!(C,A,B)
           @avxt for n in indices((B,C),2), m in indices((A,C),1)
               Cmn = zero(eltype(C))
               for k in indices((A,B),(2,1))
                  Cmn += A[m,k] * B[k,n]
               end
               C[m,n] = Cmn
           end
       end
AmulB! (generic function with 1 method)

julia> @time(AmulB!(C0,A,B)); C0 β‰ˆ C1
 10.613965 seconds (19.79 M allocations: 1.114 GiB, 4.63% gc time)
true

julia> @benchmark AmulB!($C0,$A,$B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     488.267 ns (0.00% GC)
  median time:      513.287 ns (0.00% GC)
  mean time:        513.106 ns (0.00% GC)
  maximum time:     955.810 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     195

julia> 2e-9M*K*N/488.267e-9
134.22164512449132

julia> M = K = N = 48; A = rand(M,K); B = rand(K,N); C1 = @time(A * B); C0 = similar(C1);
  0.000032 seconds (2 allocations: 18.078 KiB)

julia> @benchmark AmulB!($C0,$A,$B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.107 ΞΌs (0.00% GC)
  median time:      1.158 ΞΌs (0.00% GC)
  mean time:        1.161 ΞΌs (0.00% GC)
  maximum time:     6.008 ΞΌs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

julia> 2e-9M*K*N/1.107e-6
199.80487804878052

julia> M = K = N = 72; A = rand(M,K); B = rand(K,N); C1 = @time(A * B); C0 = similar(C1);
  0.000070 seconds (2 allocations: 40.578 KiB)

julia> @benchmark AmulB!($C0,$A,$B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.894 ΞΌs (0.00% GC)
  median time:      1.982 ΞΌs (0.00% GC)
  mean time:        1.984 ΞΌs (0.00% GC)
  maximum time:     9.586 ΞΌs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

julia> 2e-9M*K*N/1.894e-6
394.13727560718064
Matmul benchmarks on the 1165G7-laptop (compilation, 32x32, 48x48, 72x72)
julia> @time using LoopVectorization
  1.555713 seconds (2.88 M allocations: 169.122 MiB, 4.68% gc time)

julia> M = K = N = 32; A = rand(M,K); B = rand(K,N); C1 = @time(A * B); C0 = similar(C1);
  0.411346 seconds (2.47 M allocations: 131.872 MiB, 10.45% gc time, 99.94% compilation time)

julia> function AmulB!(C,A,B)
           @avxt for n in indices((B,C),2), m in indices((A,C),1)
               Cmn = zero(eltype(C))
               for k in indices((A,B),(2,1))
                  Cmn += A[m,k] * B[k,n]
               end
               C[m,n] = Cmn
           end
       end
AmulB! (generic function with 1 method)

julia> @time(AmulB!(C0,A,B)); C0 β‰ˆ C1
  8.947281 seconds (19.31 M allocations: 1.087 GiB, 4.63% gc time)
true

julia> @benchmark AmulB!($C0,$A,$B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     657.172 ns (0.00% GC)
  median time:      708.396 ns (0.00% GC)
  mean time:        713.941 ns (0.00% GC)
  maximum time:     5.318 ΞΌs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     169

julia> 2e-9M*K*N/657.172e-9
99.72427309745395

julia> M = K = N = 48; A = rand(M,K); B = rand(K,N); C1 = @time(A * B); C0 = similar(C1);
  0.000047 seconds (2 allocations: 18.078 KiB)

julia> @benchmark AmulB!($C0,$A,$B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.742 ΞΌs (0.00% GC)
  median time:      1.931 ΞΌs (0.00% GC)
  mean time:        1.932 ΞΌs (0.00% GC)
  maximum time:     5.711 ΞΌs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

julia> 2e-9M*K*N/1.742e-6
126.97129735935708

julia> M = K = N = 72; A = rand(M,K); B = rand(K,N); C1 = @time(A * B); C0 = similar(C1);
  0.003418 seconds (2 allocations: 40.578 KiB)

julia> @benchmark AmulB!($C0,$A,$B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     3.222 ΞΌs (0.00% GC)
  median time:      3.493 ΞΌs (0.00% GC)
  mean time:        3.491 ΞΌs (0.00% GC)
  maximum time:     9.560 ΞΌs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     8

julia> 2e-9M*K*N/3.222e-6
231.6871508379889
Apple M1 Native
julia> using LoopVectorization, BenchmarkTools
[ Info: Precompiling LoopVectorization [bdcacae8-1622-11e9-2a5c-532679323890]

julia> M = K = N = 32; A = rand(M,K); B = rand(K,N); C1 = @time(A * B); C0 = similar(C1);
  0.292382 seconds (2.47 M allocations: 131.780 MiB, 8.28% gc time, 99.92% compilation time)

julia> function AmulB!(C,A,B)
           @avxt for n in indices((B,C),2), m in indices((A,C),1)
               Cmn = zero(eltype(C))
               for k in indices((A,B),(2,1))
                  Cmn += A[m,k] * B[k,n]
               end
               C[m,n] = Cmn
           end
       end
AmulB! (generic function with 1 method)

julia> @time(AmulB!(C0,A,B)); C0 β‰ˆ C1
  5.170382 seconds (16.66 M allocations: 948.861 MiB, 5.60% gc time)
true

julia> @benchmark AmulB!($C0,$A,$B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.421 ΞΌs (0.00% GC)
  median time:      1.429 ΞΌs (0.00% GC)
  mean time:        1.434 ΞΌs (0.00% GC)
  maximum time:     2.754 ΞΌs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

julia> 2e-9M*K*N/1.421e-6
46.11963406052076

julia> M = K = N = 48; A = rand(M,K); B = rand(K,N); C1 = @time(A * B); C0 = similar(C1);
  0.000045 seconds (2 allocations: 18.078 KiB)

julia> @benchmark AmulB!($C0,$A,$B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     3.833 ΞΌs (0.00% GC)
  median time:      3.990 ΞΌs (0.00% GC)
  mean time:        3.998 ΞΌs (0.00% GC)
  maximum time:     7.911 ΞΌs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     8

julia> 2e-9M*K*N/3.833e-6
57.705191755804854

julia> M = K = N = 72; A = rand(M,K); B = rand(K,N); C1 = @time(A * B); C0 = similar(C1);
  0.000278 seconds (2 allocations: 40.578 KiB)

julia> @benchmark AmulB!($C0,$A,$B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     6.100 ΞΌs (0.00% GC)
  median time:      6.650 ΞΌs (0.00% GC)
  mean time:        6.662 ΞΌs (0.00% GC)
  maximum time:     14.750 ΞΌs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     5

julia> 2e-9M*K*N/6.1e-6
122.37639344262297

julia> versioninfo()
Julia Version 1.7.0-DEV.763
Commit 2ec75d65ce (2021-03-29 14:29 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin20.3.0)
  CPU: Apple M1
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, cyclone)
Environment:
  JULIA_NUM_THREADS = 8

The laptop compiles faster (8.94 vs 10.6 second β€œtime to first matmul”), but runs markedly slower, e.g. 231.7 GFLOPS vs 394 GFLOPS for 72x72 matrices.
Using the master branch of LinuxPerf; I omitted a warmup run, 10980XE (HEDT, Cascade Lake):

julia> using LinuxPerf

julia> foreachf(f::F, N, args::Vararg{<:Any,A}) where {F,A} = foreach(_ -> f(args...), Base.OneTo(N))
foreachf (generic function with 1 method)

julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads)" begin
        foreachf(AmulB!, 100_000, C0, A, B)
       end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
β•Ά cpu-cycles               3.34e+09   50.0%  #  4.1 cycles per ns
β”Œ instructions             8.32e+09   75.0%  #  2.5 insns per cycle
β”‚ branch-instructions      2.13e+08   75.0%  #  2.6% of instructions
β”” branch-misses            1.90e+06   75.0%  #  0.9% of branch instructions
β”Œ task-clock               8.18e+08  100.0%  # 817.6 ms
β”‚ context-switches         0.00e+00  100.0%
β”‚ cpu-migrations           0.00e+00  100.0%
β”” page-faults              0.00e+00  100.0%
β”Œ L1-dcache-load-misses    6.97e+08   25.0%  # 26.3% of dcache loads
β”‚ L1-dcache-loads          2.65e+09   25.0%
β”” L1-icache-load-misses    3.87e+04   25.0%
β”Œ dTLB-load-misses         0.00e+00   25.0%  #  0.0% of dTLB loads
β”” dTLB-loads               2.65e+09   25.0%
                  aggregated from 4 threads
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

1165G7 (laptop, Tiger Lake):

julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads)" begin
        foreachf(AmulB!, 100_000, C0, A, B)
       end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
β•Ά cpu-cycles               5.36e+09   50.1%  #  4.0 cycles per ns
β”Œ instructions             8.25e+09   75.1%  #  1.5 insns per cycle
β”‚ branch-instructions      1.98e+08   75.1%  #  2.4% of instructions
β”” branch-misses            2.92e+06   75.1%  #  1.5% of branch instructions
β”Œ task-clock               1.35e+09  100.0%  #  1.3 s
β”‚ context-switches         0.00e+00  100.0%
β”‚ cpu-migrations           0.00e+00  100.0%
β”” page-faults              0.00e+00  100.0%
β”Œ L1-dcache-load-misses    5.85e+08   75.1%  # 22.2% of dcache loads
β”‚ L1-dcache-loads          2.63e+09   75.1%
β”” L1-icache-load-misses    2.77e+04   75.1%
β”Œ dTLB-load-misses         1.01e+03   24.9%  #  0.0% of dTLB loads
β”” dTLB-loads               2.67e+09   24.9%
                  aggregated from 4 threads
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

While the clock speed is similar, the 10980XE hits 2.5 instructions per clock, vs just 1.5 in the laptop.
Tiger lake (and rocket lake) actually have larger reorder buffers and probably better branch predictors, etc, as well than Cascade Lake. However, they have just a single port capable of performing many common 512 bit instructions like the fused multiply add, while Cascade Lake has 2.
The link shows Cascade Lake can use ports 0 or 5, while Ice Lake can only use port 0. As a result, the reciprocal throughput is 0.5 for Cascade Lake, and 1 for Ice Lake.

But linear algebra/matlab tends to be extreme. It fairs much better in these special function benchmarks, for example:

Cascade Lake (HEDT)
julia> using VectorizationBase, SLEEFPirates

julia>  vu = Vec(10 .* rand(16)...)
2 x Vec{8, Float64}
Vec{8, Float64}<8.798383134546386, 9.022828046665666, 7.595386605047971, 8.903364923350454, 1.439724621424312, 8.799483255120942, 7.529824692778755, 9.398678780573114>
Vec{8, Float64}<1.0919972116624876, 8.5262997763817, 1.3898563445399836, 3.1224598343675214, 5.264211844189135, 4.618134635075415, 0.09844041961554195, 8.096211429945946>

julia> @benchmark log($vu)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     8.034 ns (0.00% GC)
  median time:      8.296 ns (0.00% GC)
  mean time:        8.263 ns (0.00% GC)
  maximum time:     24.097 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     999

julia> @benchmark exp($vu)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     5.918 ns (0.00% GC)
  median time:      5.999 ns (0.00% GC)
  mean time:        6.023 ns (0.00% GC)
  maximum time:     22.572 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

julia> @benchmark exp2($vu)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     4.931 ns (0.00% GC)
  median time:      4.958 ns (0.00% GC)
  mean time:        4.974 ns (0.00% GC)
  maximum time:     20.548 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

julia> @benchmark sincos($vu)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     17.426 ns (0.00% GC)
  median time:      17.515 ns (0.00% GC)
  mean time:        17.544 ns (0.00% GC)
  maximum time:     34.147 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     998
Tiger Lake (laptop)
julia> using VectorizationBase, SLEEFPirates

julia>  vu = Vec(10 .* rand(16)...)
2 x Vec{8, Float64}
Vec{8, Float64}<3.3090933998106964, 3.2408725486003176, 9.336803901394287, 0.920913844633362, 8.254820718984012, 5.840193768054263, 4.970561992905824, 3.856892363553288>
Vec{8, Float64}<2.4728363240217877, 4.159144687195548, 4.057230309919828, 2.8915047878074995, 3.2782916658210492, 3.5649015726035516, 5.126827865591639, 4.177712129409485>

julia> @benchmark log($vu)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     8.281 ns (0.00% GC)
  median time:      8.680 ns (0.00% GC)
  mean time:        8.731 ns (0.00% GC)
  maximum time:     87.352 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     999

julia> @benchmark exp($vu)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     6.542 ns (0.00% GC)
  median time:      6.999 ns (0.00% GC)
  mean time:        6.925 ns (0.00% GC)
  maximum time:     22.851 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

julia> @benchmark exp2($vu)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     6.542 ns (0.00% GC)
  median time:      6.860 ns (0.00% GC)
  mean time:        6.903 ns (0.00% GC)
  maximum time:     22.935 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

julia> @benchmark sincos($vu)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     15.802 ns (0.00% GC)
  median time:      16.948 ns (0.00% GC)
  mean time:        16.948 ns (0.00% GC)
  maximum time:     32.268 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     998
Threadripper 1950X (Zen1 Ryzen)
julia> using VectorizationBase, SLEEFPirates, BenchmarkTools

julia> vu = Vec(10 .* rand(16)...)
4 x Vec{4, Float64}
Vec{4, Float64}<2.9854020761951205, 0.5850049491494591, 8.048794496707682, 2.4447908274683128>
Vec{4, Float64}<8.989862723010972, 3.432180061119565, 5.218481943492845, 1.2908407939196165>
Vec{4, Float64}<7.147163281833258, 3.1638484401188416, 1.7880777724675267, 8.886571714674718>
Vec{4, Float64}<8.27759419030089, 3.093939597569817, 7.528383369444482, 1.3747962690374105>

julia> @benchmark log($vu)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     55.613 ns (0.00% GC)
  median time:      62.459 ns (0.00% GC)
  mean time:        62.445 ns (0.00% GC)
  maximum time:     123.043 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     982

julia> @benchmark exp($vu)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     27.950 ns (0.00% GC)
  median time:      31.336 ns (0.00% GC)
  mean time:        31.293 ns (0.00% GC)
  maximum time:     60.194 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     994

julia> @benchmark exp2($vu)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     29.472 ns (0.00% GC)
  median time:      32.920 ns (0.00% GC)
  mean time:        32.892 ns (0.00% GC)
  maximum time:     85.069 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     994

julia> @benchmark sincos($vu)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     53.570 ns (0.00% GC)
  median time:      58.309 ns (0.00% GC)
  mean time:        58.315 ns (0.00% GC)
  maximum time:     98.537 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     983
Apple M1 Native
julia> using VectorizationBase, SLEEFPirates

julia> vu = Vec(10 .* rand(16)...)
8 x Vec{2, Float64}
Vec{2, Float64}<8.349313127641388, 7.039690217372789>
Vec{2, Float64}<6.433077972087, 2.863664951775906>
Vec{2, Float64}<1.4755302144925642, 8.827766753134048>
Vec{2, Float64}<6.467495205043918, 7.489247759558242>
Vec{2, Float64}<1.755162644186694, 2.895612011356581>
Vec{2, Float64}<8.843342380245545, 9.290792214755223>
Vec{2, Float64}<6.224625954236844, 1.4039314108511536>
Vec{2, Float64}<5.011570176713553, 1.2622980565493647>

julia> @benchmark log($vu)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     38.096 ns (0.00% GC)
  median time:      38.223 ns (0.00% GC)
  mean time:        38.389 ns (0.00% GC)
  maximum time:     56.409 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     992

julia> @benchmark exp($vu)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     20.040 ns (0.00% GC)
  median time:      20.248 ns (0.00% GC)
  mean time:        20.322 ns (0.00% GC)
  maximum time:     32.899 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     998

julia> @benchmark exp2($vu)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     19.790 ns (0.00% GC)
  median time:      19.956 ns (0.00% GC)
  mean time:        19.975 ns (0.00% GC)
  maximum time:     42.877 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     998

julia> @benchmark sincos($vu)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     34.631 ns (0.00% GC)
  median time:      34.799 ns (0.00% GC)
  mean time:        34.865 ns (0.00% GC)
  maximum time:     47.530 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     995

julia> versioninfo()
Julia Version 1.7.0-DEV.763
Commit 2ec75d65ce (2021-03-29 14:29 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin20.3.0)
  CPU: Apple M1
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, cyclone)
Environment:
  JULIA_NUM_THREADS = 8

You could run these benchmarks on your own system to get an idea of performance changes. Rocket Lake should of course be faster than Tiger Lake (similar arch, faster clock speed).
E.g, sincos, Tiger Lake:

julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads)" begin
        foreachf(sincos, 10_000_000, vu)
       end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
β•Ά cpu-cycles               2.98e+09   50.0%  #  4.4 cycles per ns
β”Œ instructions             8.26e+09   75.2%  #  2.8 insns per cycle
β”‚ branch-instructions      1.24e+09   75.2%  # 15.1% of instructions
β”” branch-misses            1.60e+06   75.2%  #  0.1% of branch instructions
β”Œ task-clock               6.79e+08  100.0%  # 678.7 ms
β”‚ context-switches         0.00e+00  100.0%
β”‚ cpu-migrations           0.00e+00  100.0%
β”” page-faults              0.00e+00  100.0%
β”Œ L1-dcache-load-misses    7.62e+07   75.3%  #  3.2% of dcache loads
β”‚ L1-dcache-loads          2.39e+09   75.3%
β”” L1-icache-load-misses    4.88e+04   75.3%
β”Œ dTLB-load-misses         1.62e+06   24.7%  #  0.1% of dTLB loads
β”” dTLB-loads               2.35e+09   24.7%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Cascade Lake:

julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads)" begin
        foreachf(sincos, 10_000_000, vu)
       end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
β•Ά cpu-cycles               3.18e+09   50.0%  #  4.1 cycles per ns
β”Œ instructions             8.27e+09   75.0%  #  2.6 insns per cycle
β”‚ branch-instructions      1.24e+09   75.0%  # 15.0% of instructions
β”” branch-misses            1.49e+06   75.0%  #  0.1% of branch instructions
β”Œ task-clock               7.83e+08  100.0%  # 783.4 ms
β”‚ context-switches         0.00e+00  100.0%
β”‚ cpu-migrations           0.00e+00  100.0%
β”” page-faults              0.00e+00  100.0%
β”Œ L1-dcache-load-misses    7.56e+07   25.0%  #  3.1% of dcache loads
β”‚ L1-dcache-loads          2.45e+09   25.0%
β”” L1-icache-load-misses    3.03e+05   25.0%
β”Œ dTLB-load-misses         2.41e+05   25.0%  #  0.0% of dTLB loads
β”” dTLB-loads               2.45e+09   25.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Tiger Lake had both higher IPC and clock speeds. This is despite the fact that all of these special functions are also full of fma instructions (polynomials) – just not quite as full as matmul is.

Looking at the generated code of sincos, I think I could do a better job optimizing it for AVX512.
log and exp2 are both probably much faster than what you can get without AVX512.
But the best I can do benchmarkwise for comparison is Haswell, which came out in 2013. I’d be interested in comparisons with Zen2 and Zen3.

Note that the Ryzen (3/5)900X and (3/5)950X have 64 MiB of L3 cache. Even though they have only 2 memory channels, hopefully you could work around this by chunking your data and making use of the L3.
You can fit a lot of data in 64 MiB. Many compute heavy workloads would fit in it entirely.

When talking BLAS,

julia> sqrt(64 * (1<<20) / 8 / 3)
1672.184997739983

You can fit three 1600x1600 matrices inside 64 MiB.

I hope one day we’ll be able to buy something like the a64fx, which has 1 TB/second memory bandwidth – many times more than any x64 cpu.

Many models fit with MCMC or that have ODEs can be extremely compute intensive without requiring that much memory.

8 Likes

And make sure to get 4 sticks (I just heard of someone getting a quad channel system with two RAM sticks and they were surprised they didn’t get the expected bandwidth).

2 Likes

You give alot of information but I think it is better to recap.

  1. The Anadatech test of the AVX was discussed there in details. It is faulty. There is no case where AVX512 on single port can be that better than Ryzne 3. It seems the code path there only optimizes AVX for real in case of AVX512. It is an outlier, Extreme one.
  2. The problem with your logic about the L3 is that you assume data was there. But it is not there if it a BLAS of data which is in memory and it is large. L3 or any cache for that matter assist in one of 2 cases:
    • The data is already in cache from previous calculations.
    • Some pre caching mechanism was employed and the current pipeline of the CPU is loaded enough to allow data to reach cache before it is needed (Hence causing the previous clause to hold).
      Always remember, Caches are mainly for latency. For real throughput look at the GPU’s hierarchies. There are almost no caches. Just wide and fast memory. This is what’s needed in most cases for throughput based workload.

Regarding SIMD capability. Modern Ryzen has 2.5 units capable of AVX2 (256 bit). For FMA there are 2 execution ports. So each core can handle 512 Bits / Cycle (Just for this simple model, in reality there is latency and throughput per instruction). Cascade Lake has 2 AVX512 ports so each core can handle 1024 bit / cycle.
So for very very efficient code Cascade Lake can have factor of 2. But in practice in the previous gen the cost per core in Intel CPU’s was twice and more so you got 2 AMD cores capable in case of SIMD something similar to Intel while in most other tasks (AVX2 or serial) much more capable.

So for Workstation I’d always the most CPU’s I can have with the widest and fastest memory channel I can have.

2 Likes
  1. I don’t know the workload. For all I know it was doing a bunch of trailing zeros and then converting 64 bit integers to Float64. These are both examples of SIMD operations that exist for AVX512, but do not exist for AVX2. AVX512 isn’t just about being 512 bits.
  2. I don’t load memory from disk, do 1 blas operation, and then store to disk.

The cascade lake 10980XE is $930 on Amazon at the moment. That is $52/core, and 4 memory channels.
The Zen3 5950X is $1200. That is $75/core, almost 50% more per core, and 2 memory channels.

I would be very interested in seeing how a 3000 and 5000 series Ryzen compare in the SIMD benchmarks from my previous post, with both tiger lake and cascade lake.

Not everything is SIMD, of course. I spend a substantial amount of time running non-SIMD code myself.
For this, Zen or the M1 Macs will probably be faster – they tend to do (sometimes much) better on most benchmarks.

4 Likes

I don’t compare Ryzen to the HEDT CPU’s of Intel.
I compare Threadripper to Intel’s HEDT.

Both have Quad Channel Memory (Some Threadrippers have 8 channels, but that’s for another day).
I was talking about the prices of the AMD Threadripper 3970x vs Intel i9 9980XE.
Back then, when the Threadripper was launched, Intel core cost almost x2.
Since the Threadripper made such a huge wave Intel, in the β€œnew” (Nothing new, it is almost the same core) i9 10980XE reduced the price by half.

I believe AMD, if the shortage in CPU’s manufacturing will be over in Q2 / Q3, will release Threadripper with Zen 3 based CPU’s and will be more than competitive (Maybe 24 Cores for 1000$?). Both better in performance, even for SIMD, and better overall. But we’ll have to see…

P. S.
The price you mentioned for the Ryzen 5950x is due to constrained manufacturing and huge demand. Its official price, if I remember, is ~800$.

3 Likes

If I were you I wouldnβ€˜t by any hardware right now, the prices are nuts, especially for GPUs. Or buy used stuff and upgrade in a year or so.

1 Like

Can’t believe I’ve missed this thread for so many days. I’m also in this boat. I agree with 4GB of RAM per core. That’s actually more than necessary but it will allow you to be loose with memory management while still getting the job done. I would go with Ryzen or Threadripper. How many cores is really up to you, but the 5950X looks good to me. If you were to need more than that, it is probably good to just put it on a cluster where you can really scale it up if needed, as others have said.

If you need to use GPU, I would just use a cloud service rather than pay the current prices.

3 Likes

Something that I just remembered: getting a UPS that has enough energy under full load to get to the next checkpoint that saves results and allows orderly shutdown is also worthwhile in this context.

And a power plug lock or a lockable wire mesh that is bolted to the wall, so that the power cannot be unplugged accidentally (eg by cleaning staff for a vacuum cleaner β€” this actually happened).

4 Likes

I’m not sure how often, but it seems I lose power for a few seconds at least once every couple months. I have to reset a few digital clocks, but thanks to my UPS my desktop and any unsaved work or ongoing simulation is fine.

5 Likes

Good thoughts! I’ve never thought of this. Will have to pick one up

1 Like