Disappointing benchmark results with AMD Threadripper PRO 3975WX 32 Cores

Hello,

I have the Lenovo ThinkStation P620 with the AMD Threadripper PRO 3975WX with 32 cores and 64 Threads. I installed Ubuntu 22.04.2, and I noticed that the performances are not so much improved with respect for example to my DELL XPS 15 laptop with the Intel i7-10750H.

So, I decided to use the BaseBenchmarks.jl package to test the performances on both the machines. This is the code

# -p 6 for the case of my DELL laptop
julia --project -p 32

julia> using BaseBenchmarks

julia> BaseBenchmarks.load!("parallel")
1-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "parallel" => 1-element BenchmarkTools.BenchmarkGroup:
          tags: []
          "remotecall" => 5-element BenchmarkTools.BenchmarkGroup:
                  tags: ["io", "remotecall_fetch"]
                  ("identity", 1024) => Benchmark(evals=1, seconds=5.0, samples=10000)
                  ("identity", 4096) => Benchmark(evals=1, seconds=5.0, samples=10000)
                  ("identity", 2) => Benchmark(evals=1, seconds=5.0, samples=10000)
                  ("identity", 64) => Benchmark(evals=1, seconds=5.0, samples=10000)
                  ("identity", 512) => Benchmark(evals=1, seconds=5.0, samples=10000)

julia> results = run(BaseBenchmarks.SUITE["parallel"], verbose=true);
(1/1) benchmarking "remotecall"...
  (1/5) benchmarking ("identity", 1024)...
  done (took 2.796757715 seconds)
  (2/5) benchmarking ("identity", 4096)...
  done (took 2.844155886 seconds)
  (3/5) benchmarking ("identity", 2)...
  done (took 2.847565344 seconds)
  (4/5) benchmarking ("identity", 64)...
  done (took 2.822584622 seconds)
  (5/5) benchmarking ("identity", 512)...
  done (took 2.835860633 seconds)
done (took 14.923916606 seconds)

compared to the 17.42338672 seconds of my XPS laptop.

I tried also some linalg calculations

####### DELL XPS 15 LAPTOP WITH i7-10750H
julia> BaseBenchmarks.SUITE["linalg"];

julia> run(BaseBenchmarks.SUITE["linalg"]["factorization"], verbose=true);
(1/32) benchmarking ("eigen", "typename(LinearAlgebra.LowerTriangular)", 256)...
done (took 5.577282505 seconds)
(2/32) benchmarking ("qr", "Matrix", 256)...
done (took 5.601520295 seconds)
(3/32) benchmarking ("svd", "typename(LinearAlgebra.UpperTriangular)", 1024)...
done (took 5.707468825 seconds)
(4/32) benchmarking ("eigen", "typename(LinearAlgebra.Diagonal)", 1024)...
done (took 4.506185221 seconds)
(5/32) benchmarking ("svd", "typename(LinearAlgebra.LowerTriangular)", 1024)...
done (took 5.744405861 seconds)
(6/32) benchmarking ("svd", "typename(LinearAlgebra.Diagonal)", 1024)...
done (took 5.601461302 seconds)
(7/32) benchmarking ("eigen", "Matrix", 256)...
done (took 5.666419938 seconds)
(8/32) benchmarking ("eigen", "typename(LinearAlgebra.UpperTriangular)", 256)...
done (took 5.900232165 seconds)
(9/32) benchmarking ("eigen", "typename(LinearAlgebra.SymTridiagonal)", 256)...
done (took 5.589953574 seconds)
(10/32) benchmarking ("eigen", "typename(LinearAlgebra.Diagonal)", 256)...
done (took 0.960068424 seconds)
(11/32) benchmarking ("schur", "Matrix", 256)...
done (took 5.649411057 seconds)
(12/32) benchmarking ("lu", "Matrix", 256)...
done (took 5.589925312 seconds)
(13/32) benchmarking ("svd", "typename(LinearAlgebra.Bidiagonal)", 256)...
done (took 5.661472392 seconds)
(14/32) benchmarking ("lu", "typename(LinearAlgebra.Tridiagonal)", 256)...
done (took 0.935210964 seconds)
(15/32) benchmarking ("cholesky", "Matrix", 256)...
done (took 4.292936646 seconds)
(16/32) benchmarking ("eigen", "typename(LinearAlgebra.Bidiagonal)", 1024)...
done (took 5.65713484 seconds)
(17/32) benchmarking ("eigen", "typename(LinearAlgebra.UpperTriangular)", 1024)...
done (took 5.631416995 seconds)
(18/32) benchmarking ("svd", "typename(LinearAlgebra.Diagonal)", 256)...
done (took 1.417589805 seconds)
(19/32) benchmarking ("eigen", "typename(LinearAlgebra.LowerTriangular)", 1024)...
done (took 5.596913195 seconds)
(20/32) benchmarking ("svd", "Matrix", 256)...
done (took 5.594872972 seconds)
(21/32) benchmarking ("eigen", "typename(LinearAlgebra.SymTridiagonal)", 1024)...
done (took 5.668368997 seconds)
(22/32) benchmarking ("eigen", "Matrix", 1024)...
done (took 6.59772988 seconds)
(23/32) benchmarking ("svd", "typename(LinearAlgebra.LowerTriangular)", 256)...
done (took 5.638113192 seconds)
(24/32) benchmarking ("eigen", "typename(LinearAlgebra.Bidiagonal)", 256)...
done (took 2.647477007 seconds)
(25/32) benchmarking ("lu", "typename(LinearAlgebra.Tridiagonal)", 1024)...
done (took 0.758005603 seconds)
(26/32) benchmarking ("cholesky", "Matrix", 1024)...
done (took 5.594054503 seconds)
(27/32) benchmarking ("qr", "Matrix", 1024)...
done (took 5.661409009 seconds)
(28/32) benchmarking ("svd", "typename(LinearAlgebra.UpperTriangular)", 256)...
done (took 5.636165226 seconds)
(29/32) benchmarking ("svd", "Matrix", 1024)...
done (took 6.000668539 seconds)
(30/32) benchmarking ("schur", "Matrix", 1024)...
done (took 6.257883476 seconds)
(31/32) benchmarking ("lu", "Matrix", 1024)...
done (took 5.633049943 seconds)
(32/32) benchmarking ("svd", "typename(LinearAlgebra.Bidiagonal)", 1024)...
done (took 5.660954732 seconds)
####### Workstation with Threadripper PRO 3975WX
julia> BaseBenchmarks.SUITE["linalg"];

julia> run(BaseBenchmarks.SUITE["linalg"]["factorization"], verbose=true);
(1/32) benchmarking ("eigen", "typename(LinearAlgebra.LowerTriangular)", 256)...
done (took 5.792936375 seconds)
(2/32) benchmarking ("qr", "Matrix", 256)...
done (took 5.82432053 seconds)
(3/32) benchmarking ("svd", "typename(LinearAlgebra.UpperTriangular)", 1024)...
done (took 5.846795333 seconds)
(4/32) benchmarking ("eigen", "typename(LinearAlgebra.Diagonal)", 1024)...
done (took 5.842180115 seconds)
(5/32) benchmarking ("svd", "typename(LinearAlgebra.LowerTriangular)", 1024)...
done (took 6.027253846 seconds)
(6/32) benchmarking ("svd", "typename(LinearAlgebra.Diagonal)", 1024)...
done (took 5.840626707 seconds)
(7/32) benchmarking ("eigen", "Matrix", 256)...
done (took 5.86221959 seconds)
(8/32) benchmarking ("eigen", "typename(LinearAlgebra.UpperTriangular)", 256)...
done (took 5.83575609 seconds)
(9/32) benchmarking ("eigen", "typename(LinearAlgebra.SymTridiagonal)", 256)...
done (took 5.827527666 seconds)
(10/32) benchmarking ("eigen", "typename(LinearAlgebra.Diagonal)", 256)...
done (took 1.49688383 seconds)
(11/32) benchmarking ("schur", "Matrix", 256)...
done (took 5.860205249 seconds)
(12/32) benchmarking ("lu", "Matrix", 256)...
done (took 5.826403242 seconds)
(13/32) benchmarking ("svd", "typename(LinearAlgebra.Bidiagonal)", 256)...
done (took 5.843292346 seconds)
(14/32) benchmarking ("lu", "typename(LinearAlgebra.Tridiagonal)", 256)...
done (took 1.231770398 seconds)
(15/32) benchmarking ("cholesky", "Matrix", 256)...
done (took 5.062861838 seconds)
(16/32) benchmarking ("eigen", "typename(LinearAlgebra.Bidiagonal)", 1024)...
done (took 5.842612846 seconds)
(17/32) benchmarking ("eigen", "typename(LinearAlgebra.UpperTriangular)", 1024)...
done (took 5.883940138 seconds)
(18/32) benchmarking ("svd", "typename(LinearAlgebra.Diagonal)", 256)...
done (took 2.152332864 seconds)
(19/32) benchmarking ("eigen", "typename(LinearAlgebra.LowerTriangular)", 1024)...
done (took 5.876088368 seconds)
(20/32) benchmarking ("svd", "Matrix", 256)...
done (took 5.866537029 seconds)
(21/32) benchmarking ("eigen", "typename(LinearAlgebra.SymTridiagonal)", 1024)...
done (took 5.871889849 seconds)
(22/32) benchmarking ("eigen", "Matrix", 1024)...
done (took 6.359308098 seconds)
(23/32) benchmarking ("svd", "typename(LinearAlgebra.LowerTriangular)", 256)...
done (took 5.837971065 seconds)
(24/32) benchmarking ("eigen", "typename(LinearAlgebra.Bidiagonal)", 256)...
done (took 3.166776022 seconds)
(25/32) benchmarking ("lu", "typename(LinearAlgebra.Tridiagonal)", 1024)...
done (took 1.005836625 seconds)
(26/32) benchmarking ("cholesky", "Matrix", 1024)...
done (took 5.834697383 seconds)
(27/32) benchmarking ("qr", "Matrix", 1024)...
done (took 5.864963557 seconds)
(28/32) benchmarking ("svd", "typename(LinearAlgebra.UpperTriangular)", 256)...
done (took 5.850566432 seconds)
(29/32) benchmarking ("svd", "Matrix", 1024)...
done (took 6.107092893 seconds)
(30/32) benchmarking ("schur", "Matrix", 1024)...
done (took 5.910521713 seconds)
(31/32) benchmarking ("lu", "Matrix", 1024)...
done (took 5.833513394 seconds)
(32/32) benchmarking ("svd", "typename(LinearAlgebra.Bidiagonal)", 1024)...
done (took 5.908529379 seconds)

How is it possible that the performances of a €3000 processor are comparable with those of a laptop processor?

1 Like

Have you tried -t (or --threads) rather than -p?

I have set the JULIA_NUM_THREADS=auto on both the machines and I verified it showing Threads.nthreads(), which returns 12 and 64 for le laptop and the workstation respectively.

Moreover, the β€œparallel” BenchmarkGroup works only with Distributed (and so with the -p option rather than the -t one). But I tested the β€œlinalg” BenchmarkGroup and it had the same performances as before.

Perhaps this would be better tested with fewer threads at these tests may be memory bound. Also there are only actually 32 cores, so it may perform better with threads set to 32.

I don’t think the time it takes to run the benchmarks is what you want to look at. Have you looked at the actual results?

1 Like

This! BenchmarkTools uses a fix time limit for it’s data-gathering. Basically up to a number of iterations or 5s it will keep running the benchmarks. So in essence your question should be β€œHow much faster did the individual benchmarks get” or β€œhow many more iterations am I running”.

It’s probably also better to start with Benchmarks you are familiar with yourself instead of looking at all of BaseBenchmarks, which might target specific subsystems that have little to do with the performance of your chip.

I would recommend looking at GitHub - JuliaPerf/STREAMBenchmark.jl: A version of the STREAM benchmark which measures the sustainable memory bandwidth. and GitHub - JuliaPerf/BandwidthBenchmark.jl: Measuring memory bandwidth using TheBandwidthBenchmark

8 Likes

Thanks for the suggestion.

I tried your code. This is the result for my laptop:

julia> memory_bandwidth(verbose=true)
╔══║ Multi-threaded:
╠══║ (12 threads)
β•Ÿβ”€ COPY:  34791.5 MB/s
β•Ÿβ”€ SCALE: 37382.4 MB/s
β•Ÿβ”€ ADD:   28168.6 MB/s
β•Ÿβ”€ TRIAD: 27789.1 MB/s
β•Ÿβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β•‘ Median: 31480.0 MB/s
β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•
(median = 31480.0, minimum = 27789.1, maximum = 37382.4)

julia> benchmark()
╔══║ Single-threaded:
β•Ÿβ”€ COPY:  36690.6 MB/s
β•Ÿβ”€ SCALE: 38898.6 MB/s
β•Ÿβ”€ ADD:   28170.5 MB/s
β•Ÿβ”€ TRIAD: 29096.8 MB/s
β•Ÿβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β•‘ Median: 32893.7 MB/s
β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•

╔══║ Multi-threaded:
╠══║ (12 threads)
β•Ÿβ”€ COPY:  32033.9 MB/s
β•Ÿβ”€ SCALE: 33561.6 MB/s
β•Ÿβ”€ ADD:   26964.9 MB/s
β•Ÿβ”€ TRIAD: 29481.3 MB/s
β•Ÿβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β•‘ Median: 30757.6 MB/s
β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•

(single = (median = 32893.7, minimum = 28170.5, maximum = 38898.6), multi = (median = 30757.6, minimum = 26964.9, maximum = 33561.6))

julia> y = scaling_benchmark();

julia> lineplot(1:length(y), y, title = "Bandwidth Scaling", xlabel = "# cores", ylabel = "MB/s", border = :ascii, canvas = AsciiCanvas)
                            Bandwidth Scaling
               +----------------------------------------+
        60 000 |                                        |
               |                               .__      |
               |                           ..-/'  """"""|
               |                      ._-*"`            |
               |                  ..-/'                 |
               |            __.-*"`                     |
               |       .-*""                            |
   MB/s        |      .'                                |
               |     r`                                 |
               |   .*                                   |
               |  ./                                    |
               | .`                                     |
               |/`                                      |
               |                                        |
        20 000 |                                        |
               +----------------------------------------+
                1                                      6
                                 # cores

And this is the result of the workstations

julia> memory_bandwidth(verbose=true)
╔══║ Multi-threaded:
╠══║ (64 threads)
β•Ÿβ”€ COPY:  22000.3 MB/s
β•Ÿβ”€ SCALE: 21805.1 MB/s
β•Ÿβ”€ ADD:   29679.3 MB/s
β•Ÿβ”€ TRIAD: 28922.7 MB/s
β•Ÿβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β•‘ Median: 25461.5 MB/s
β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•
(median = 25461.5, minimum = 21805.1, maximum = 29679.3)

julia> benchmark()
╔══║ Single-threaded:
β•Ÿβ”€ COPY:  133127.7 MB/s
β•Ÿβ”€ SCALE: 153588.8 MB/s
β•Ÿβ”€ ADD:   122452.9 MB/s
β•Ÿβ”€ TRIAD: 161538.1 MB/s
β•Ÿβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β•‘ Median: 143358.2 MB/s
β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•

╔══║ Multi-threaded:
╠══║ (64 threads)
β•Ÿβ”€ COPY:  21736.2 MB/s
β•Ÿβ”€ SCALE: 22339.9 MB/s
β•Ÿβ”€ ADD:   28752.6 MB/s
β•Ÿβ”€ TRIAD: 30396.7 MB/s
β•Ÿβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β•‘ Median: 25546.3 MB/s
β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•

(single = (median = 143358.2, minimum = 122452.9, maximum = 161538.1), multi = (median = 25546.3, minimum = 21736.2, maximum = 30396.7))

julia> y = scaling_benchmark();

julia> lineplot(1:length(y), y, title = "Bandwidth Scaling", xlabel = "# cores", ylabel = "MB/s", border = :ascii, canvas = AsciiCanvas)
                             Bandwidth Scaling
                +----------------------------------------+
        400 000 |                                        |
                |                                        |
                |                       .   .            |
                |               _.._..-*\.-/\---"`       |
                |             ./ "'      `               |
                |          .r"'                          |
                |      .  /`                             |
   MB/s         |     /\./                               |
                |    .`                                  |
                | .r"`                                   |
                | /                                      |
                | `                                      |
                |                                        |
                |                                        |
              0 |                                        |
                +----------------------------------------+
                 0                                     40
                                  # cores

The memory bandwidth seems faster on the Threadripper workstation, which sounds good. I should try other CPU benchmarks, perhaps some BLAS/LAPACK routines and native multithreaded operations.

Ok, I implemented the following benchmark code

using BenchmarkTools
using LinearAlgebra

# Some heavy function
@everywhere function gamma(x)
    p = [0.99999999999980993, 676.5203681218851, -1259.1392167224028,
         771.32342877765313, -176.61502916214059, 12.507343278686905,
         -0.13857109526572012, 9.9843695780195716e-6, 1.5056327351493116e-7]
    y = x
    if y < 0.5
        return pi / (sin(pi * y) * gamma(1 - y))
    else
        y -= 1
        t = p[1]
        for i in 2:length(p)
            t += p[i] / (y + i)
        end
        z = y + length(p) - 0.5
        return sqrt(2 * pi) * z^(y + 0.5) * exp(-z) * t
    end
end


# Set global variables
array_size = 2000
vector_size = 10^5
max_time = 5

# BLAS operation
println("BLAS operation:")
A = rand(array_size, array_size)
res = A * A # warm up
@btime $A * $A seconds=max_time

# LAPACK operation
println("LAPACK operation:")
B = rand(array_size, array_size)
res = eigvals(B) # warm up
@btime eigvals($B) seconds=max_time

# Multithreading
println("Multithreading:")
function threaded_dot(x, y)
    s = Threads.Atomic{Float64}(0.0)
    @assert length(x) == length(y)
    Threads.@threads for i in eachindex(x)
        Threads.atomic_add!(s, x[i] * y[i] * sin(x[i])^2 * gamma(y[i]))
    end
    return s[]
end
x = rand(vector_size)
y = rand(vector_size)
res = threaded_dot(x, y) # warm up
@btime threaded_dot($x, $y) seconds=max_time

# Distributed computing
println("Distributed computing:")
@everywhere function distributed_dot(x, y)
    @assert length(x) == length(y)
    s = @distributed (+) for i in eachindex(x)
        x[i] * y[i] * sin(x[i])^2 * gamma(y[i])
    end
    return s
end
x = rand(vector_size)
y = rand(vector_size)
res = distributed_dot(x, y) # warm up
@btime distributed_dot($x, $y) seconds=max_time

The result of my laptop

alberto@XPS-NIKTEN:~$ julia -p 6 -t 6 benchmark.jl
BLAS operation:
  109.859 ms (2 allocations: 30.52 MiB)
LAPACK operation:
  3.147 s (17 allocations: 31.16 MiB)
Multithreading:
  6.553 ms (150108 allocations: 18.32 MiB)
Distributed computing:
  5.391 ms (549 allocations: 22.27 KiB)

alberto@XPS-NIKTEN:~$ julia -p 2 -t 2 benchmark.jl
BLAS operation:
  104.162 ms (2 allocations: 30.52 MiB)
LAPACK operation:
  2.885 s (17 allocations: 31.16 MiB)
Multithreading:
  6.428 ms (150155 allocations: 18.33 MiB)
Distributed computing:
  5.414 ms (187 allocations: 7.59 KiB)

The result of my workstation

alberto@athena-ThinkStation-P620:~$ julia -p 32 -t 32 benchmark.jl
BLAS operation:
  23.039 ms (2 allocations: 30.52 MiB)
LAPACK operation:
  1.659 s (17 allocations: 31.16 MiB)
Multithreading:
  4.120 ms (150133 allocations: 18.32 MiB)
Distributed computing:
  13.076 ms (2888 allocations: 113.95 KiB)

alberto@athena-ThinkStation-P620:~$ julia -p 2 -t 2 benchmark.jl
BLAS operation:
  23.188 ms (2 allocations: 30.52 MiB)
LAPACK operation:
  1.667 s (17 allocations: 31.16 MiB)
Multithreading:
  11.836 ms (150223 allocations: 18.34 MiB)
Distributed computing:
  5.159 ms (187 allocations: 7.59 KiB)

The workstation seems slightly better.
Oddly enough, by changing the threads or the number of processors to -t or -p, the performance is quite similar or even worse.

And implementing an heavier function including quadrature integration, I can finally see some differences

using BenchmarkTools
using LinearAlgebra
@everywhere using SpecialFunctions
@everywhere using QuadGK


# Set global variables
array_size = 2000
vector_size = 10^5
max_time = 20

# BLAS operation
println("BLAS operation:")
A = rand(array_size, array_size)
res = A * A # warm up
@btime $A * $A seconds=max_time

# LAPACK operation
println("LAPACK operation:")
B = rand(array_size, array_size)
res = eigvals(B) # warm up
@btime eigvals($B) seconds=max_time

# Multithreading
println("Multithreading:")
function threaded_dot(x, y)
    s = Threads.Atomic{Float64}(0.0)
    @assert length(x) == length(y)
    Threads.@threads for i in eachindex(x)
        result, err = quadgk(zeta, 1.1, x[i]+1.1)
        Threads.atomic_add!(s, x[i] * y[i] * result)
    end
    return s[]
end
x = rand(vector_size)
y = rand(vector_size)
res = threaded_dot(x, y) # warm up
@btime threaded_dot($x, $y) seconds=max_time

# Distributed computing
println("Distributed computing:")
@everywhere function distributed_dot(x, y)
    @assert length(x) == length(y)      
    s = @distributed (+) for i in eachindex(x)
        result, err = quadgk(zeta, 1.1, x[i]+1.1)
        x[i] * y[i] * result
    end
    return s
end
x = rand(vector_size)
y = rand(vector_size)
res = distributed_dot(x, y) # warm up
@btime distributed_dot($x, $y) seconds=max_time

My laptop:

(base) alberto@XPS-NIKTEN:~$ julia -p 6 -t 6 benchmark.jl 
BLAS operation:
  97.135 ms (2 allocations: 30.52 MiB)
LAPACK operation:
  2.783 s (17 allocations: 31.16 MiB)
Multithreading:
  191.017 ms (159530 allocations: 27.99 MiB)
Distributed computing:
  205.408 ms (540 allocations: 21.24 KiB)


(base) alberto@XPS-NIKTEN:~$ julia -p 2 -t 2 benchmark.jl 
BLAS operation:
  103.821 ms (2 allocations: 30.52 MiB)
LAPACK operation:
  2.555 s (17 allocations: 31.16 MiB)
Multithreading:
  427.550 ms (159066 allocations: 27.91 MiB)
Distributed computing:
  430.130 ms (183 allocations: 7.21 KiB)

My workstation:

(base) alberto@athena-ThinkStation-P620:~$ julia -p 32 -t 32 benchmark.jl
BLAS operation:
  22.929 ms (2 allocations: 30.52 MiB)
LAPACK operation:
  1.650 s (17 allocations: 31.16 MiB)
Multithreading:
  27.753 ms (159578 allocations: 27.99 MiB)
Distributed computing:
  54.550 ms (2839 allocations: 108.51 KiB)


(base) alberto@athena-ThinkStation-P620:~$ julia -p 2 -t 2 benchmark.jl
BLAS operation:
  23.234 ms (2 allocations: 30.52 MiB)
LAPACK operation:
  1.665 s (17 allocations: 31.16 MiB)
Multithreading:
  411.314 ms (159312 allocations: 27.95 MiB)
Distributed computing:
  405.329 ms (184 allocations: 7.24 KiB)

The two core performances are very similar between the two machines, which is strange. But I can see the differences when I turn on all the cores of the Threadripper.

If you look at single core performance, thread ripper is not much better, which is by design. And that explains many of the benchmark results

From our experience, it is essential to force BLAS to use only one thread otherwise results are very bad. You simply add the following line at the top of your script:
BLAS.set_num_threads(1)
Then the behaviour of parallelism is predictable and in agreement with expectations.

1 Like