Help wanted: benchmarking multi-threaded CPU performance

Putting this in Offtopic as it’s not directly a Julia question: I recently asked on here about a fairly straightforward number crunching problem and - invariably - got some great solutions to it, the fastest of which beat my own implementation by 1,000x…

Now I just got a new laptop which I expected to run this problem much faster than my old one but I was disappointed to find the speedup was smaller than expected. There’s a possibility of setting up a dedicated workstation for solving this type of problem loads, and given the slightly disappointing speedup from my change in CPU I’m trying to work out whether there are alternative CPUs which would do significantly better on this workload.

I tried looking into AWS, Azure etc. but it seems they don’t offer workstation type processors like i9/Xeon or Ryzen/Threadripper for the most part - so I’m asking for help here: if you’ve got a reasonably current high performance CPU, could you run the below script (on all available threads) and post the vector of 9 numbers that it produces, together with your CPU?

Here’s my 20 × 13th Gen Intel(R) Core(TM) i7-13800H:

$~-> julia --threads=auto

julia> include("cpu_test.jl")
[28.7, 28.6, 27.8, 27.3, 26.1, 27.0, 26.6, 26.4, 23.8]

On that CPU, the main loop that produces the numbers takes about 30 seconds to run, so the whole script hopefully shouldn’t take too long even with installing the few small packages that it relies on.

Code below the fold:

cpu_test.jl
using Pkg
Pkg.activate(; temp=true)
Pkg.add(["Chairmarks", "Combinatorics", "StatsBase"])
using Chairmarks, Combinatorics, StatsBase, Random

function mkbitmatrix(selections)
    n = length(selections)
    P = 64
    res = falses(P,n)
    
    for (i,c) in enumerate(selections)
        res[c,i] .= true
    end

    res
end

function matchmeXT(actual_selections, possible_selections, ::Val{p}) where p
    out = zeros(Int32, p+1, size(possible_selections,2))
    mask = ~((-1%UInt64) << 10)
    @inbounds Threads.@threads for i = 1:length(possible_selections.chunks)
        a = possible_selections.chunks[i]
        
        for jc = 1:1023:length(actual_selections.chunks)
            tmp = 0

            for j = jc:min(jc+1023-1, length(actual_selections.chunks))
                b = actual_selections.chunks[j]
                hits = count_ones(a & b)
                tmp += (1 << ( (hits*10) & 63) )
            end #j

            for k = 0:p
                out[k+1, i] += (tmp >>> (k*10)) & mask
            end

            tmp = 0

        end

    end

    out
end    

function get_time_one(P, p)
    Q = binomial(P, p)

    all_selections_iterator = multiset_combinations(1:P, p) # Iterator over all combinations
    actual_selections = unique([sort(sample(1:P, p, replace = false)) for _ ∈ 1:round(Int, Q ÷ 10)])
    actuals = mkbitmatrix(actual_selections); 
    possibilities = mkbitmatrix(all_selections_iterator)

    x = @b matchmeXT(actuals, possibilities, Val(p))

    return (; P, p, Q, n_selected = length(actual_selections), t = x.time, n_comparisons = Q*length(actual_selections),
        bn_n_per_s = round(Q*length(actual_selections) / x.time / 1e9, digits = 1))
end

function get_time_many(sizes)
    res = [get_time_one(x, 6) for x ∈ shuffle(sizes)]
end

print(getfield.(get_time_many(20:2:36), :bn_n_per_s))
1 Like

I did a couple tests on my Ryzen 5700X (8 core, 16 threads). Obviously not a threadripper, but it does have the same architecture as Threadripper Pro 5000. It also may or may not matter, but my RAM is 3200 MHz.
Full 16 threads:
[38.0, 34.6, 37.1, 36.6, 33.8, 36.5, 34.3, 35.0, 35.8]
12 threads:
[32.0, 31.6, 32.2, 29.7, 31.9, 31.6, 30.4, 32.2, 32.7]
8 threads:
[29.7, 35.2, 28.5, 34.4, 28.2, 35.2, 29.0, 30.3, 28.8]
4 threads:
[19.7, 19.6, 19.4, 17.6, 18.2, 19.1, 19.1, 18.8, 18.6]
2 threads:
[9.8, 9.9, 10.1, 7.7, 9.9, 9.9, 9.8, 10.0, 10.0]
1 thread:
[5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.9, 5.0, 5.1]

1 Like

AMD Ryzen™ 7 7840U on battery in performance mode on Ubuntu 24.04

[58.5, 58.3, 60.7, 54.3, 60.8, 60.6, 62.2, 59.6, 63.5]

RAM speed 5600 MHz.

UPDATE:
Same laptop, but grid connected:

[81.7, 72.3, 68.0, 69.6, 73.8, 68.5, 63.4, 67.0, 73.6]

2 Likes

On an AMD Epyc 9554P (64 physical cores):

1 thread:
[3.1, 6.8, 7.0, 7.4, 7.1, 7.1, 7.2, 7.1, 7.1]

30 threads:
[104.7, 211.3, 202.2, 191.0, 202.6, 168.3, 194.8, 202.8, 202.3]

60 threads:
[207.2, 252.4, 175.5, 92.8, 256.4, 269.9, 253.9, 253.7, 282.1]

RAM at 4800 MT/s.

EDIT:

60 threads but with @carstenbauer’s ThreadPinning.jl pinning cores:
[366.6, 255.2, 357.4, 397.2, 189.7, 395.5, 367.0, 391.0, 399.6]

That is seriously stunning. What an amazing deal. Two lines of code to set and forget for such huge gains.

3 Likes

Thanks all, I’m surprised by how bad my CPU is!

julia> print(getfield.(get_time_many(20:2:36), :bn_n_per_s))
[33.2, 51.0, 52.0, 56.1, 48.3, 43.4, 45.6, 48.6, 49.9]
julia> versioninfo()
Julia Version 1.10.2
Commit bd47eca2c8a (2024-03-01 10:14 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 16 × 11th Gen Intel(R) Core(TM) i9-11900K @ 3.50GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, rocketlake)
Threads: 16 default, 0 interactive, 8 GC (on 16 virtual cores)
1 Like

Probably too small for a workstation but here’s the Ryzen 7800X3D (8 cores with hyperthreading, thus 16 threads)
[80.3, 79.3, 82.7, 76.6, 77.1, 80.3, 78.5, 82.7, 82.1]

1 Like
julia> print(getfield.(get_time_many(20:2:36), :bn_n_per_s))
[71.1, 76.5, 70.7, 73.5, 68.5, 67.8, 71.6, 71.0, 74.0]
julia> versioninfo()
Julia Version 1.11.0-beta1
Commit 08e1fc0abb (2024-04-10 08:40 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 32 × AMD Ryzen 9 5950X 16-Core Processor
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, znver3)
Threads: 32 default, 0 interactive, 16 GC (on 32 virtual cores)
1 Like

That’s surprisingly slow compared to the Ryzen 7s in this thread, or do I just not understand how Ryzen naming conventions work?

1 Like

Did you give Julia 20 threads when you ran your version? The 12th gen and on intel CPUs have the heterogeneous core design, and in my experience with my mobile 12th gen CPU (i7-1260p) the E-cores are slightly worse than useless.

From a google, your chip has 6 performance cores. I bet if you ran with julia -t6 [...] you would see very similar results if not something slightly better. I have 4 performance cores, and only laptop julia -t4 is often the best in terms of performance. Or at the least, utilizing the additional E-cores is not better.

I don’t know what your old processor was, but if it was somewhat recent it probably had, like, four cores that weren’t terribly worse than the P-cores on your new processor. And if your old laptop was willing to hang out at a higher level of power draw than your new one, that could also really reduce the performance gains you see.

1 Like

Does Julia take that into account? Maybe -t auto should at least?

Don’t bully me!

I think it probably is due to clockspeed and architecture improvements. My Ryzen 9 is almost 4 years old, the Ryzen 7 7800X3D is about 6 months old and has a much faster clockspeed.

EDIT: Also, there are lots of external places that provide CPU benchmarks, even if they aren’t perfect. https://cpu.userbenchmark.com/

PSA: userbenchmark is worse than useless if you actually want meaningful comparisons, it’s actually banned from r/hardware at this point.

2 Likes

Julia does not take that into account. Feel free to create a feature request stating that -t auto should create only one thread per performance core…

1 Like

Doesn’t look like that’s the case, here’s the mean of the top 3 results for varying numbers of thrads:

1 => 4.2
2 => 8.0
4 => 15.9
6 => 22.0
10 => 22.1
12 => 23.9
20 => 28.9

Scaling is linear up to 4, and almost up to 6, but the extra threads do add a bit (just not as much as the first 6).

userbenchmark is active misinformation. There’s some sort of weird bizarre grudge against AMD from whoever is running it

The Ryzen 5950X is basically two Ryzen 5700X (same architecture, twice the cores), so getting twice the performance of a 5700X makes sense. It seems the newer Ryzen architecture (7XXX on desktop, XX4XX on mobile) are doing a lot better at this task.

@threads creates a closure.
@inbounds doesn’t penetrate closures.
You want

Threads.@threads for ...
   @inbounds begin

Similarly, p wouldn’t be a compile time constant within the @threads code, but you aren’t using it there at all. It’d only be a problem if you were using it in a way that demanded it be constant, like Val(p) which would be type unstable otherwise.

I actually don’t see where you’re using it in a constant-demanding fashion outside of the @threads, either, so why not just have it be p instead of ::Val{p} to avoid overspecialization?

2 Likes

Results for 1, 2, 4, 8, 16, 24 and 32 threads

[4.8, 4.9, 4.9, 4.9, 4.9, 4.9, 4.9, 4.9, 4.9]
[9.4, 9.8, 9.8, 9.7, 9.7, 9.8, 9.7, 9.8, 9.7]
[19.3, 18.5, 19.5, 19.6, 19.3, 19.2, 19.0, 18.4, 19.6]
[35.7, 38.8, 39.1, 36.7, 39.0, 37.0, 39.3, 36.6, 38.8]
[28.1, 45.6, 28.3, 29.5, 43.8, 36.8, 29.0, 43.9, 46.2]
[42.3, 55.1, 54.0, 44.0, 46.1, 43.4, 40.9, 49.2, 55.0]
[57.4, 61.4, 55.5, 58.9, 63.6, 61.7, 62.5, 55.5, 58.7]

on 13th Gen i9-13900 (8 performance cores, 16 efficiency)

1 Like

Here’s what I see on my desktop Ryzen 5600x 6 core, Zen 3 processor:

❯ julia --project=. --startup=no -t1 cpu_test.jl
┌ Info: N per ns
│   all = (5.0, 4.9, 5.1, 4.9, 5.0, 5.0, 4.8, 4.9, 5.0)
│   max = 5.1
│   median = 5.0
└   nthreads = 1
❯ julia --project=. --startup=no -t2 cpu_test.jl
┌ Info: N per ns
│   all = (9.8, 10.0, 9.9, 10.0, 10.1, 10.1, 10.1, 10.0, 9.7)
│   max = 10.1
│   median = 10.0
└   nthreads = 2
❯ julia --project=. --startup=no -t4 cpu_test.jl
┌ Info: N per ns
│   all = (19.0, 19.3, 19.3, 19.7, 19.7, 19.8, 19.8, 19.8, 19.1)
│   max = 19.8
│   median = 19.7
└   nthreads = 4
❯ julia --project=. --startup=no -t6 cpu_test.jl
┌ Info: N per ns
│   all = (26.8, 27.2, 28.6, 27.0, 25.9, 28.5, 22.0, 26.6, 27.5)
│   max = 28.6
│   median = 27.0
└   nthreads = 6
❯ julia --project=. --startup=no -t12 cpu_test.jl
┌ Info: N per ns
│   all = (30.7, 29.2, 29.5, 30.5, 30.4, 30.9, 29.6, 30.8, 30.4)
│   max = 30.9
│   median = 30.4
└   nthreads = 12

and here is what I see on my new Ryzen 7840u laptop processor (8 core Zen 4 mobile processor):

❯ julia --project=. --startup=no -t1 cpu_test.jl
┌ Info: N per ns
│   all = (9.1, 8.7, 9.3, 9.2, 9.5, 9.2, 9.3, 9.2, 8.8)
│   max = 9.5
│   median = 9.2
└   nthreads = 1
❯ julia --project=. --startup=no -t2 cpu_test.jl
┌ Info: N per ns
│   all = (14.9, 17.8, 18.3, 18.0, 18.1, 17.3, 17.7, 17.4, 17.5)
│   max = 18.3
│   median = 17.7
└   nthreads = 2
❯ julia --project=. --startup=no -t4 cpu_test.jl
┌ Info: N per ns
│   all = (33.1, 33.0, 34.9, 34.2, 34.2, 34.4, 35.0, 34.9, 36.0)
│   max = 36.0
│   median = 34.4
└   nthreads = 4
❯ julia --project=. --startup=no -t6 cpu_test.jl
┌ Info: N per ns
│   all = (50.0, 44.3, 49.0, 51.2, 45.4, 51.2, 49.1, 50.4, 50.6)
│   max = 51.2
│   median = 50.0
└   nthreads = 6
❯ julia --project=. --startup=no -t8 cpu_test.jl
┌ Info: N per ns
│   all = (65.6, 60.4, 65.4, 65.9, 63.2, 60.2, 64.5, 53.6, 61.8)
│   max = 65.9
│   median = 63.2
└   nthreads = 8
❯ julia --project=. --startup=no -t16 cpu_test.jl
┌ Info: N per ns
│   all = (74.3, 67.7, 75.6, 75.6, 71.3, 76.3, 67.9, 73.1, 82.2)
│   max = 82.2
│   median = 74.3
└   nthreads = 16

So as you can see here, going with even just a one generation newer processor, you can get a pretty massive performance uplift. Presumably, a well cooled desktop Zen 5 processor could do even better than the numbers I showed here.

2 Likes