How to choose a workstation for optimal performance

Elrod · March 29, 2021, 7:45am

Aside from AVX512, the biggest improvement in Rocket Lake over the previous generation is that their integrated graphics are much better.

Still worth pointing out that my Tiger Lake laptop has an 1165G7 which comes with 96 execution units, vs 32 in Rocket Lake. Other than meaning the laptop is 3x faster for graphics, I’m not sure what the ramifications are.
Does it matter if you don’t play games or use oneAPI.jl? Will you have problems streaming videos?

Looking online, if FP32 is an indicator, 96 EU of Xe graphics: 1_690 GFLOPS
GTX 650, which seems to sell for $60-$100: 812.5 GFLOPS
Presumably 32 EU like in Rocket Lake will get you around 560 GFLOPS.

Also on AVX512, it’s worth pointing out that there is segmentation between Intel’s HEDT/high end server chips and their low end server and consumer chips in throughput.
The chips can see huge benefit from AVX512, e.g. in AnanadTech’s test with 290W power draw the 11700K was more than 5x faster than the competition.

For example:

Matmul benchmarks on the 10980XE-HEDT (compilation, 32x32, 48x48, 72x72)

julia> using LoopVectorization, BenchmarkTools

julia> M = K = N = 32; A = rand(M,K); B = rand(K,N); C1 = @time(A * B); C0 = similar(C1);
  0.492570 seconds (2.48 M allocations: 132.454 MiB, 9.66% gc time, 99.92% compilation time)

julia> function AmulB!(C,A,B)
           @avxt for n in indices((B,C),2), m in indices((A,C),1)
               Cmn = zero(eltype(C))
               for k in indices((A,B),(2,1))
                  Cmn += A[m,k] * B[k,n]
               end
               C[m,n] = Cmn
           end
       end
AmulB! (generic function with 1 method)

julia> @time(AmulB!(C0,A,B)); C0 ≈ C1
 10.613965 seconds (19.79 M allocations: 1.114 GiB, 4.63% gc time)
true

julia> @benchmark AmulB!($C0,$A,$B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     488.267 ns (0.00% GC)
  median time:      513.287 ns (0.00% GC)
  mean time:        513.106 ns (0.00% GC)
  maximum time:     955.810 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     195

julia> 2e-9M*K*N/488.267e-9
134.22164512449132

julia> M = K = N = 48; A = rand(M,K); B = rand(K,N); C1 = @time(A * B); C0 = similar(C1);
  0.000032 seconds (2 allocations: 18.078 KiB)

julia> @benchmark AmulB!($C0,$A,$B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.107 μs (0.00% GC)
  median time:      1.158 μs (0.00% GC)
  mean time:        1.161 μs (0.00% GC)
  maximum time:     6.008 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

julia> 2e-9M*K*N/1.107e-6
199.80487804878052

julia> M = K = N = 72; A = rand(M,K); B = rand(K,N); C1 = @time(A * B); C0 = similar(C1);
  0.000070 seconds (2 allocations: 40.578 KiB)

julia> @benchmark AmulB!($C0,$A,$B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.894 μs (0.00% GC)
  median time:      1.982 μs (0.00% GC)
  mean time:        1.984 μs (0.00% GC)
  maximum time:     9.586 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

julia> 2e-9M*K*N/1.894e-6
394.13727560718064

Matmul benchmarks on the 1165G7-laptop (compilation, 32x32, 48x48, 72x72)

julia> @time using LoopVectorization
  1.555713 seconds (2.88 M allocations: 169.122 MiB, 4.68% gc time)

julia> M = K = N = 32; A = rand(M,K); B = rand(K,N); C1 = @time(A * B); C0 = similar(C1);
  0.411346 seconds (2.47 M allocations: 131.872 MiB, 10.45% gc time, 99.94% compilation time)

julia> function AmulB!(C,A,B)
           @avxt for n in indices((B,C),2), m in indices((A,C),1)
               Cmn = zero(eltype(C))
               for k in indices((A,B),(2,1))
                  Cmn += A[m,k] * B[k,n]
               end
               C[m,n] = Cmn
           end
       end
AmulB! (generic function with 1 method)

julia> @time(AmulB!(C0,A,B)); C0 ≈ C1
  8.947281 seconds (19.31 M allocations: 1.087 GiB, 4.63% gc time)
true

julia> @benchmark AmulB!($C0,$A,$B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     657.172 ns (0.00% GC)
  median time:      708.396 ns (0.00% GC)
  mean time:        713.941 ns (0.00% GC)
  maximum time:     5.318 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     169

julia> 2e-9M*K*N/657.172e-9
99.72427309745395

julia> M = K = N = 48; A = rand(M,K); B = rand(K,N); C1 = @time(A * B); C0 = similar(C1);
  0.000047 seconds (2 allocations: 18.078 KiB)

julia> @benchmark AmulB!($C0,$A,$B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.742 μs (0.00% GC)
  median time:      1.931 μs (0.00% GC)
  mean time:        1.932 μs (0.00% GC)
  maximum time:     5.711 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

julia> 2e-9M*K*N/1.742e-6
126.97129735935708

julia> M = K = N = 72; A = rand(M,K); B = rand(K,N); C1 = @time(A * B); C0 = similar(C1);
  0.003418 seconds (2 allocations: 40.578 KiB)

julia> @benchmark AmulB!($C0,$A,$B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     3.222 μs (0.00% GC)
  median time:      3.493 μs (0.00% GC)
  mean time:        3.491 μs (0.00% GC)
  maximum time:     9.560 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     8

julia> 2e-9M*K*N/3.222e-6
231.6871508379889

Apple M1 Native

julia> using LoopVectorization, BenchmarkTools
[ Info: Precompiling LoopVectorization [bdcacae8-1622-11e9-2a5c-532679323890]

julia> M = K = N = 32; A = rand(M,K); B = rand(K,N); C1 = @time(A * B); C0 = similar(C1);
  0.292382 seconds (2.47 M allocations: 131.780 MiB, 8.28% gc time, 99.92% compilation time)

julia> function AmulB!(C,A,B)
           @avxt for n in indices((B,C),2), m in indices((A,C),1)
               Cmn = zero(eltype(C))
               for k in indices((A,B),(2,1))
                  Cmn += A[m,k] * B[k,n]
               end
               C[m,n] = Cmn
           end
       end
AmulB! (generic function with 1 method)

julia> @time(AmulB!(C0,A,B)); C0 ≈ C1
  5.170382 seconds (16.66 M allocations: 948.861 MiB, 5.60% gc time)
true

julia> @benchmark AmulB!($C0,$A,$B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.421 μs (0.00% GC)
  median time:      1.429 μs (0.00% GC)
  mean time:        1.434 μs (0.00% GC)
  maximum time:     2.754 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

julia> 2e-9M*K*N/1.421e-6
46.11963406052076

julia> M = K = N = 48; A = rand(M,K); B = rand(K,N); C1 = @time(A * B); C0 = similar(C1);
  0.000045 seconds (2 allocations: 18.078 KiB)

julia> @benchmark AmulB!($C0,$A,$B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     3.833 μs (0.00% GC)
  median time:      3.990 μs (0.00% GC)
  mean time:        3.998 μs (0.00% GC)
  maximum time:     7.911 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     8

julia> 2e-9M*K*N/3.833e-6
57.705191755804854

julia> M = K = N = 72; A = rand(M,K); B = rand(K,N); C1 = @time(A * B); C0 = similar(C1);
  0.000278 seconds (2 allocations: 40.578 KiB)

julia> @benchmark AmulB!($C0,$A,$B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     6.100 μs (0.00% GC)
  median time:      6.650 μs (0.00% GC)
  mean time:        6.662 μs (0.00% GC)
  maximum time:     14.750 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     5

julia> 2e-9M*K*N/6.1e-6
122.37639344262297

julia> versioninfo()
Julia Version 1.7.0-DEV.763
Commit 2ec75d65ce (2021-03-29 14:29 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin20.3.0)
  CPU: Apple M1
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, cyclone)
Environment:
  JULIA_NUM_THREADS = 8

The laptop compiles faster (8.94 vs 10.6 second “time to first matmul”), but runs markedly slower, e.g. 231.7 GFLOPS vs 394 GFLOPS for 72x72 matrices.
Using the master branch of LinuxPerf; I omitted a warmup run, 10980XE (HEDT, Cascade Lake):

julia> using LinuxPerf

julia> foreachf(f::F, N, args::Vararg{<:Any,A}) where {F,A} = foreach(_ -> f(args...), Base.OneTo(N))
foreachf (generic function with 1 method)

julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads)" begin
        foreachf(AmulB!, 100_000, C0, A, B)
       end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles               3.34e+09   50.0%  #  4.1 cycles per ns
┌ instructions             8.32e+09   75.0%  #  2.5 insns per cycle
│ branch-instructions      2.13e+08   75.0%  #  2.6% of instructions
└ branch-misses            1.90e+06   75.0%  #  0.9% of branch instructions
┌ task-clock               8.18e+08  100.0%  # 817.6 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              0.00e+00  100.0%
┌ L1-dcache-load-misses    6.97e+08   25.0%  # 26.3% of dcache loads
│ L1-dcache-loads          2.65e+09   25.0%
└ L1-icache-load-misses    3.87e+04   25.0%
┌ dTLB-load-misses         0.00e+00   25.0%  #  0.0% of dTLB loads
└ dTLB-loads               2.65e+09   25.0%
                  aggregated from 4 threads
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

1165G7 (laptop, Tiger Lake):

julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads)" begin
        foreachf(AmulB!, 100_000, C0, A, B)
       end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles               5.36e+09   50.1%  #  4.0 cycles per ns
┌ instructions             8.25e+09   75.1%  #  1.5 insns per cycle
│ branch-instructions      1.98e+08   75.1%  #  2.4% of instructions
└ branch-misses            2.92e+06   75.1%  #  1.5% of branch instructions
┌ task-clock               1.35e+09  100.0%  #  1.3 s
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              0.00e+00  100.0%
┌ L1-dcache-load-misses    5.85e+08   75.1%  # 22.2% of dcache loads
│ L1-dcache-loads          2.63e+09   75.1%
└ L1-icache-load-misses    2.77e+04   75.1%
┌ dTLB-load-misses         1.01e+03   24.9%  #  0.0% of dTLB loads
└ dTLB-loads               2.67e+09   24.9%
                  aggregated from 4 threads
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

While the clock speed is similar, the 10980XE hits 2.5 instructions per clock, vs just 1.5 in the laptop.
Tiger lake (and rocket lake) actually have larger reorder buffers and probably better branch predictors, etc, as well than Cascade Lake. However, they have just a single port capable of performing many common 512 bit instructions like the fused multiply add, while Cascade Lake has 2.
The link shows Cascade Lake can use ports 0 or 5, while Ice Lake can only use port 0. As a result, the reciprocal throughput is 0.5 for Cascade Lake, and 1 for Ice Lake.

But linear algebra/matlab tends to be extreme. It fairs much better in these special function benchmarks, for example:

Cascade Lake (HEDT)

julia> using VectorizationBase, SLEEFPirates

julia>  vu = Vec(10 .* rand(16)...)
2 x Vec{8, Float64}
Vec{8, Float64}<8.798383134546386, 9.022828046665666, 7.595386605047971, 8.903364923350454, 1.439724621424312, 8.799483255120942, 7.529824692778755, 9.398678780573114>
Vec{8, Float64}<1.0919972116624876, 8.5262997763817, 1.3898563445399836, 3.1224598343675214, 5.264211844189135, 4.618134635075415, 0.09844041961554195, 8.096211429945946>

julia> @benchmark log($vu)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     8.034 ns (0.00% GC)
  median time:      8.296 ns (0.00% GC)
  mean time:        8.263 ns (0.00% GC)
  maximum time:     24.097 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     999

julia> @benchmark exp($vu)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     5.918 ns (0.00% GC)
  median time:      5.999 ns (0.00% GC)
  mean time:        6.023 ns (0.00% GC)
  maximum time:     22.572 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

julia> @benchmark exp2($vu)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     4.931 ns (0.00% GC)
  median time:      4.958 ns (0.00% GC)
  mean time:        4.974 ns (0.00% GC)
  maximum time:     20.548 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

julia> @benchmark sincos($vu)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     17.426 ns (0.00% GC)
  median time:      17.515 ns (0.00% GC)
  mean time:        17.544 ns (0.00% GC)
  maximum time:     34.147 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     998

Tiger Lake (laptop)

julia> using VectorizationBase, SLEEFPirates

julia>  vu = Vec(10 .* rand(16)...)
2 x Vec{8, Float64}
Vec{8, Float64}<3.3090933998106964, 3.2408725486003176, 9.336803901394287, 0.920913844633362, 8.254820718984012, 5.840193768054263, 4.970561992905824, 3.856892363553288>
Vec{8, Float64}<2.4728363240217877, 4.159144687195548, 4.057230309919828, 2.8915047878074995, 3.2782916658210492, 3.5649015726035516, 5.126827865591639, 4.177712129409485>

julia> @benchmark log($vu)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     8.281 ns (0.00% GC)
  median time:      8.680 ns (0.00% GC)
  mean time:        8.731 ns (0.00% GC)
  maximum time:     87.352 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     999

julia> @benchmark exp($vu)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     6.542 ns (0.00% GC)
  median time:      6.999 ns (0.00% GC)
  mean time:        6.925 ns (0.00% GC)
  maximum time:     22.851 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

julia> @benchmark exp2($vu)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     6.542 ns (0.00% GC)
  median time:      6.860 ns (0.00% GC)
  mean time:        6.903 ns (0.00% GC)
  maximum time:     22.935 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

julia> @benchmark sincos($vu)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     15.802 ns (0.00% GC)
  median time:      16.948 ns (0.00% GC)
  mean time:        16.948 ns (0.00% GC)
  maximum time:     32.268 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     998

Threadripper 1950X (Zen1 Ryzen)

julia> using VectorizationBase, SLEEFPirates, BenchmarkTools

julia> vu = Vec(10 .* rand(16)...)
4 x Vec{4, Float64}
Vec{4, Float64}<2.9854020761951205, 0.5850049491494591, 8.048794496707682, 2.4447908274683128>
Vec{4, Float64}<8.989862723010972, 3.432180061119565, 5.218481943492845, 1.2908407939196165>
Vec{4, Float64}<7.147163281833258, 3.1638484401188416, 1.7880777724675267, 8.886571714674718>
Vec{4, Float64}<8.27759419030089, 3.093939597569817, 7.528383369444482, 1.3747962690374105>

julia> @benchmark log($vu)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     55.613 ns (0.00% GC)
  median time:      62.459 ns (0.00% GC)
  mean time:        62.445 ns (0.00% GC)
  maximum time:     123.043 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     982

julia> @benchmark exp($vu)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     27.950 ns (0.00% GC)
  median time:      31.336 ns (0.00% GC)
  mean time:        31.293 ns (0.00% GC)
  maximum time:     60.194 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     994

julia> @benchmark exp2($vu)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     29.472 ns (0.00% GC)
  median time:      32.920 ns (0.00% GC)
  mean time:        32.892 ns (0.00% GC)
  maximum time:     85.069 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     994

julia> @benchmark sincos($vu)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     53.570 ns (0.00% GC)
  median time:      58.309 ns (0.00% GC)
  mean time:        58.315 ns (0.00% GC)
  maximum time:     98.537 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     983

Apple M1 Native

julia> using VectorizationBase, SLEEFPirates

julia> vu = Vec(10 .* rand(16)...)
8 x Vec{2, Float64}
Vec{2, Float64}<8.349313127641388, 7.039690217372789>
Vec{2, Float64}<6.433077972087, 2.863664951775906>
Vec{2, Float64}<1.4755302144925642, 8.827766753134048>
Vec{2, Float64}<6.467495205043918, 7.489247759558242>
Vec{2, Float64}<1.755162644186694, 2.895612011356581>
Vec{2, Float64}<8.843342380245545, 9.290792214755223>
Vec{2, Float64}<6.224625954236844, 1.4039314108511536>
Vec{2, Float64}<5.011570176713553, 1.2622980565493647>

julia> @benchmark log($vu)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     38.096 ns (0.00% GC)
  median time:      38.223 ns (0.00% GC)
  mean time:        38.389 ns (0.00% GC)
  maximum time:     56.409 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     992

julia> @benchmark exp($vu)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     20.040 ns (0.00% GC)
  median time:      20.248 ns (0.00% GC)
  mean time:        20.322 ns (0.00% GC)
  maximum time:     32.899 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     998

julia> @benchmark exp2($vu)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     19.790 ns (0.00% GC)
  median time:      19.956 ns (0.00% GC)
  mean time:        19.975 ns (0.00% GC)
  maximum time:     42.877 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     998

julia> @benchmark sincos($vu)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     34.631 ns (0.00% GC)
  median time:      34.799 ns (0.00% GC)
  mean time:        34.865 ns (0.00% GC)
  maximum time:     47.530 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     995

julia> versioninfo()
Julia Version 1.7.0-DEV.763
Commit 2ec75d65ce (2021-03-29 14:29 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin20.3.0)
  CPU: Apple M1
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, cyclone)
Environment:
  JULIA_NUM_THREADS = 8

You could run these benchmarks on your own system to get an idea of performance changes. Rocket Lake should of course be faster than Tiger Lake (similar arch, faster clock speed).
E.g, sincos, Tiger Lake:

julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads)" begin
        foreachf(sincos, 10_000_000, vu)
       end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles               2.98e+09   50.0%  #  4.4 cycles per ns
┌ instructions             8.26e+09   75.2%  #  2.8 insns per cycle
│ branch-instructions      1.24e+09   75.2%  # 15.1% of instructions
└ branch-misses            1.60e+06   75.2%  #  0.1% of branch instructions
┌ task-clock               6.79e+08  100.0%  # 678.7 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              0.00e+00  100.0%
┌ L1-dcache-load-misses    7.62e+07   75.3%  #  3.2% of dcache loads
│ L1-dcache-loads          2.39e+09   75.3%
└ L1-icache-load-misses    4.88e+04   75.3%
┌ dTLB-load-misses         1.62e+06   24.7%  #  0.1% of dTLB loads
└ dTLB-loads               2.35e+09   24.7%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Cascade Lake:

julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads)" begin
        foreachf(sincos, 10_000_000, vu)
       end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles               3.18e+09   50.0%  #  4.1 cycles per ns
┌ instructions             8.27e+09   75.0%  #  2.6 insns per cycle
│ branch-instructions      1.24e+09   75.0%  # 15.0% of instructions
└ branch-misses            1.49e+06   75.0%  #  0.1% of branch instructions
┌ task-clock               7.83e+08  100.0%  # 783.4 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              0.00e+00  100.0%
┌ L1-dcache-load-misses    7.56e+07   25.0%  #  3.1% of dcache loads
│ L1-dcache-loads          2.45e+09   25.0%
└ L1-icache-load-misses    3.03e+05   25.0%
┌ dTLB-load-misses         2.41e+05   25.0%  #  0.0% of dTLB loads
└ dTLB-loads               2.45e+09   25.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Tiger Lake had both higher IPC and clock speeds. This is despite the fact that all of these special functions are also full of fma instructions (polynomials) – just not quite as full as matmul is.

Looking at the generated code of sincos, I think I could do a better job optimizing it for AVX512.
log and exp2 are both probably much faster than what you can get without AVX512.
But the best I can do benchmarkwise for comparison is Haswell, which came out in 2013. I’d be interested in comparisons with Zen2 and Zen3.

Note that the Ryzen (3/5)900X and (3/5)950X have 64 MiB of L3 cache. Even though they have only 2 memory channels, hopefully you could work around this by chunking your data and making use of the L3.
You can fit a lot of data in 64 MiB. Many compute heavy workloads would fit in it entirely.

When talking BLAS,

julia> sqrt(64 * (1<<20) / 8 / 3)
1672.184997739983

You can fit three 1600x1600 matrices inside 64 MiB.

I hope one day we’ll be able to buy something like the a64fx, which has 1 TB/second memory bandwidth – many times more than any x64 cpu.

Many models fit with MCMC or that have ODEs can be extremely compute intensive without requiring that much memory.

sijo · March 29, 2021, 11:33am

And make sure to get 4 sticks (I just heard of someone getting a quad channel system with two RAM sticks and they were surprised they didn’t get the expected bandwidth).

RoyiAvital · March 29, 2021, 5:29pm

You give alot of information but I think it is better to recap.

The Anadatech test of the AVX was discussed there in details. It is faulty. There is no case where AVX512 on single port can be that better than Ryzne 3. It seems the code path there only optimizes AVX for real in case of AVX512. It is an outlier, Extreme one.
The problem with your logic about the L3 is that you assume data was there. But it is not there if it a BLAS of data which is in memory and it is large. L3 or any cache for that matter assist in one of 2 cases:
- The data is already in cache from previous calculations.
- Some pre caching mechanism was employed and the current pipeline of the CPU is loaded enough to allow data to reach cache before it is needed (Hence causing the previous clause to hold).
  Always remember, Caches are mainly for latency. For real throughput look at the GPU’s hierarchies. There are almost no caches. Just wide and fast memory. This is what’s needed in most cases for throughput based workload.

Regarding SIMD capability. Modern Ryzen has 2.5 units capable of AVX2 (256 bit). For FMA there are 2 execution ports. So each core can handle 512 Bits / Cycle (Just for this simple model, in reality there is latency and throughput per instruction). Cascade Lake has 2 AVX512 ports so each core can handle 1024 bit / cycle.
So for very very efficient code Cascade Lake can have factor of 2. But in practice in the previous gen the cost per core in Intel CPU’s was twice and more so you got 2 AMD cores capable in case of SIMD something similar to Intel while in most other tasks (AVX2 or serial) much more capable.

So for Workstation I’d always the most CPU’s I can have with the widest and fastest memory channel I can have.

Elrod · March 29, 2021, 5:47pm

I don’t know the workload. For all I know it was doing a bunch of trailing zeros and then converting 64 bit integers to Float64. These are both examples of SIMD operations that exist for AVX512, but do not exist for AVX2. AVX512 isn’t just about being 512 bits.
I don’t load memory from disk, do 1 blas operation, and then store to disk.

The cascade lake 10980XE is $930 on Amazon at the moment. That is $52/core, and 4 memory channels.
The Zen3 5950X is $1200. That is $75/core, almost 50% more per core, and 2 memory channels.

I would be very interested in seeing how a 3000 and 5000 series Ryzen compare in the SIMD benchmarks from my previous post, with both tiger lake and cascade lake.

Not everything is SIMD, of course. I spend a substantial amount of time running non-SIMD code myself.
For this, Zen or the M1 Macs will probably be faster – they tend to do (sometimes much) better on most benchmarks.

RoyiAvital · March 29, 2021, 7:03pm

I don’t compare Ryzen to the HEDT CPU’s of Intel.
I compare Threadripper to Intel’s HEDT.

Both have Quad Channel Memory (Some Threadrippers have 8 channels, but that’s for another day).
I was talking about the prices of the AMD Threadripper 3970x vs Intel i9 9980XE.
Back then, when the Threadripper was launched, Intel core cost almost x2.
Since the Threadripper made such a huge wave Intel, in the “new” (Nothing new, it is almost the same core) i9 10980XE reduced the price by half.

I believe AMD, if the shortage in CPU’s manufacturing will be over in Q2 / Q3, will release Threadripper with Zen 3 based CPU’s and will be more than competitive (Maybe 24 Cores for 1000$?). Both better in performance, even for SIMD, and better overall. But we’ll have to see…

P. S.
The price you mentioned for the Ryzen 5950x is due to constrained manufacturing and huge demand. Its official price, if I remember, is ~800$.

maxfreu · April 3, 2021, 11:02am

If I were you I wouldn‘t by any hardware right now, the prices are nuts, especially for GPUs. Or buy used stuff and upgrade in a year or so.

tbeason · April 6, 2021, 1:57am

Can’t believe I’ve missed this thread for so many days. I’m also in this boat. I agree with 4GB of RAM per core. That’s actually more than necessary but it will allow you to be loose with memory management while still getting the job done. I would go with Ryzen or Threadripper. How many cores is really up to you, but the 5950X looks good to me. If you were to need more than that, it is probably good to just put it on a cluster where you can really scale it up if needed, as others have said.

If you need to use GPU, I would just use a cloud service rather than pay the current prices.

Tamas_Papp · April 6, 2021, 12:45pm

Something that I just remembered: getting a UPS that has enough energy under full load to get to the next checkpoint that saves results and allows orderly shutdown is also worthwhile in this context.

And a power plug lock or a lockable wire mesh that is bolted to the wall, so that the power cannot be unplugged accidentally (eg by cleaning staff for a vacuum cleaner — this actually happened).

Elrod · April 6, 2021, 1:10pm

I’m not sure how often, but it seems I lose power for a few seconds at least once every couple months. I have to reset a few digital clocks, but thanks to my UPS my desktop and any unsaved work or ongoing simulation is fine.

tbeason · April 6, 2021, 3:34pm

Good thoughts! I’ve never thought of this. Will have to pick one up

JeffreySarnoff · September 10, 2021, 1:47pm

any new perspectives?

johnh · September 10, 2021, 2:00pm

I work for Dell so I would of course say that Dell kit is the best.
I would agree with Tamas Papp above - do you REALLY want to spend $5000 on a workstation?
Think of where you are going to house it also.
If you don’t want to go cloud, I would think about a rack mount workstation and remote access
AMD Milan is available, but remember the power requirements.

drarnau · September 10, 2021, 2:14pm

I ended up buying a HP Z6 G4 with as many cores and RAM as my budget allowed.

I use it intensively, and I’m way happier with it than with any cloud service I’ve tried. I mainly test my codes on my laptop and move to the workstation when I’m confident they do what I want. Zero transition costs from my laptop to the workstation. Codes run as on my laptop but way faster. I’ve never had that with any cloud service.

It might be just me, but I prefer this to a cloud service, given how I work.

jzr · September 10, 2021, 5:33pm

In this case I am curious what is the benefit of running on the laptop originally?

drarnau · September 11, 2021, 10:21am

The workstation tends to be busy and at full capacity most of the time. Hence, it is convenient to make progress on my laptop while the previous step of the process is still running on the workstation.

There is nothing particular about a laptop. I find it generally useful to have two computers if one of the machines tends to be really busy (especially if one uses a lot of RAM). Of course, I also find it advantageous to sit on the train and execute stuff on my workstation from my laptop (that my laptop could not handle) but that has nothing to do with running samples on my laptop first.

anon69491625 · October 9, 2021, 5:32pm

Hi there
getting a new m/c for 2022 budget of $1500. I want to reuse my old trusty Alienware area 51 box, I have been shoving new MB’s in it since I first bought it for $200 on craigslist. It’s fun to look at

You mentioned AMD Ryzen 5000 cpu, could you suggest which one please?
secondly which Motherboard would you suggest.

I have no use for gaming. I’m just going to use this for Julia development, I also will be trying to utilize a Jetson Nano somehow but for now it’s just simple development, nothing special. I would like to max out the ram on the MB.
Thank you

tbeason · October 9, 2021, 7:42pm

The biggest question you need to ask yourself is if you need a dedicated GPU and if so how powerful it needs to be. Because of crypto, these have become the most expensive component (for those who need it). If you have an old one you can use, then you can afford to go big on the CPU and get either the 5950X, 5900X, or 5800X. If you are going to be needing a powerful GPU as well, then my real advice is to up your budget because even mid-tier cards are $700+.

More ballpark figures:
64GB of RAM will be about $300 if you find a sale
Motherboard perhaps $200 but varies
Storage depends on need but could guess about $200 for 2TB SSD

anon69491625 · October 9, 2021, 8:11pm

hi there @tbeason
gotcha on the gpu. I was going to use an old one first, concentrate on the cpu ram first.

So here’s where I am right now
Asus ROG Strix B550-E
AMD Ryzen R9 5900X with with Wraith Prism LED Cooler ( amazon bundle)
Scythe Fuma 2 CPU Air Cooler just in case the Wraith sucks
G.Skill RipJaws V Series 32GB (2 x 16GB) 288-Pin SDRAM PC4-25600 DDR4 3200
my old [GeForce GTX 750 Ti] gpu
my old Seagate model: ST1000DM010-2EP102
next gen hd SAMSUNG 980 PRO M.2 2280 500GB
my trusty Alienware Area-51 X58 box

Please let me know your thoughts. I just want to get a system I can update later on. I’m not sure which power supply to get OR if the bundled cooling solution is the best. I was going to build it out and let my coding tell me what to do next to improve performance but I can’t do that unless I have something to start on. ANY help would be appreciated.

Oscar_Smith · October 9, 2021, 8:36pm

a 5600 will probably be faster than a 3900 for the CPU

anon69491625 · October 9, 2021, 9:00pm

Hi @Oscar_Smith and @tbeason
noted and thanks. I have changed my config to the 5900x. Makes sense. Have to look into a decent cooling solution as right now my current one is taking the side off the Alienware and blasting it with an old desk fan. Does pretty well actually

Topic		Replies	Views
Workstation advice (for mostly Julia use) Offtopic question	22	3260	December 25, 2020
Thinking about buying a multicore system of ebay. Would appreciate any thoughts or experiences Offtopic multithreading	40	2127	January 7, 2020
Show off Julia performance on your PC! Performance	53	4503	April 26, 2020
Ebay server vs home constructed ONLY for julia coding Offtopic	8	901	October 19, 2021
What can cause significantly different performance for pisum microbenchmark on different workstations Performance	11	1062	May 12, 2019

How to choose a workstation for optimal performance

Related topics