How to achieve perfect scaling with Threads (Julia 1.7.1)

carstenbauer · January 7, 2023, 3:15pm

While it is commented out in his OP, he already has pinthreads(:cores) there. Given that Threads.nthreads() <= ncores() this ensures that Julia threads are pinned to different cores.

Salmon · January 12, 2023, 3:05pm

Hey again,
I have had a look into many of your suggestions and would like to follow up.
First, instead of using exp, I now perform some simpler floating point calculations instead. It might be that there was indeed a bit of thread scaling left on the table due to the call to exp. The scaling is now slightly improved to 44.48x for 48 cores.
For reference, here is the slightly simpified code I am trying out:

Summary

using ThreadPinning
pinthreads(:cores)
using BenchmarkTools
using LIKWID

function SolveProb!(res,N1)
	Threads.@threads for i in eachindex(res)
		x = 0.
		for k in 1:N1
			kfl = float(k)
			for j in 1:500
            	x += 1e-10 * kfl
			end
		end
		res[i] = x
	end
end
NThreadsMax = 48
res = zeros(NThreadsMax)
N1 = 100_000
BM1 = @perfmon "FLOPS_DP" SolveProb!(res,N1) 

BM = @benchmark SolveProb!($res,$N1) samples = 5*Threads.nthreads() evals = 5 seconds = 5*60
##
avg(x) = sum(x)/length(x)
println(BM)
@info "BM" BM times = BM.times Threads.nthreads() TotalCPUtime = avg(BM.times)*Threads.nthreads() average = avg(BM.times) min = minimum(BM.times)

##

Still, I believe I can understand more from this example.
In particular, I tried out LIKWID.jl and MCAnalyzer.jl although I need a bit of help in interpreting the output. IntelITT seems hard to access, since it requires one to compile Julia from scratch with specific flags which I cannot really do on my HPC cluster. Perhaps there is another way?

In particular, using LIKWID’s @perfmon macro, I get the following (here for only four cores to not clutter this text too much)

Summary

Group: FLOPS_DP
┌──────────────────────────────────────────┬───────────┬──────────┬──────────┬──────────┐
│                                    Event │  Thread 1 │ Thread 2 │ Thread 3 │ Thread 4 │
├──────────────────────────────────────────┼───────────┼──────────┼──────────┼──────────┤
│                        INSTR_RETIRED_ANY │ 2.67259e8 │      0.0 │      0.0 │      0.0 │
│                    CPU_CLK_UNHALTED_CORE │ 2.49107e8 │      0.0 │      0.0 │      0.0 │
│                     CPU_CLK_UNHALTED_REF │ 2.13958e8 │      0.0 │      0.0 │      0.0 │
│ FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE │       0.0 │      0.0 │      0.0 │      0.0 │
│      FP_ARITH_INST_RETIRED_SCALAR_DOUBLE │ 5.16819e7 │      0.0 │      0.0 │      0.0 │
│ FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE │       0.0 │      0.0 │      0.0 │      0.0 │
│ FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE │       0.0 │      0.0 │      0.0 │      0.0 │
└──────────────────────────────────────────┴───────────┴──────────┴──────────┴──────────┘
┌──────────────────────┬──────────┬──────────┬──────────┬──────────┐
│               Metric │ Thread 1 │ Thread 2 │ Thread 3 │ Thread 4 │
├──────────────────────┼──────────┼──────────┼──────────┼──────────┤
│  Runtime (RDTSC) [s] │ 0.134637 │ 0.134637 │ 0.134637 │ 0.134637 │
│ Runtime unhalted [s] │ 0.103796 │      0.0 │      0.0 │      0.0 │
│          Clock [MHz] │  2794.24 │      NaN │      NaN │      NaN │
│                  CPI │  0.93208 │      NaN │      NaN │      NaN │
│         DP [MFLOP/s] │  383.862 │      0.0 │      0.0 │      0.0 │
│     AVX DP [MFLOP/s] │      0.0 │      0.0 │      0.0 │      0.0 │
│  AVX512 DP [MFLOP/s] │      0.0 │      0.0 │      0.0 │      0.0 │
│     Packed [MUOPS/s] │      0.0 │      0.0 │      0.0 │      0.0 │
│     Scalar [MUOPS/s] │  383.862 │      0.0 │      0.0 │      0.0 │
│  Vectorization ratio │      0.0 │      NaN │      NaN │      NaN │
└──────────────────────┴──────────┴──────────┴──────────┴──────────┘

Can anyone tell me how to interpret this? To me it looks as if the other threads are not even doing anything which is of course nonsense, since then there would be no speedup at all.
Also, which of these metrics can I use to detect problems that hinder perfect speedup?

In terms of MCAnalyzer, there is also not too much to say for me:

Summary

julia> analyze(SolveProb!,(Vector{Float64},Int))
warning: found a call in the input assembly sequence.
note: call instructions are not correctly modeled. Assume a latency of 100cy.
warning: found a return instruction in the input assembly sequence.
note: program counter updates are ignored.
Iterations:        100
Instructions:      6800
Total Cycles:      5368
Total uOps:        10200

Dispatch Width:    6
uOps Per Cycle:    1.90
IPC:               1.27
Block RThroughput: 29.0


Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)

[1]    [2]    [3]    [4]    [5]    [6]    Instructions:
 3      2     1.00           *            pushq %r15
 3      2     1.00           *            pushq %r14
 3      2     1.00           *            pushq %r13
 3      2     1.00           *            pushq %r12
 3      2     1.00           *            pushq %rbx
 1      1     0.25                        subq  $48, %rsp
 1      1     0.25                        movq  %rsi, %r14
 1      1     0.25                        movq  %rdi, %r15
 1      0     0.17                        vxorps        %xmm0, %xmm0, %xmm0
 2      1     1.00           *            vmovaps       %xmm0, (%rsp)
 1      1     0.25                        movabsq       $23008863367056, %rbx
 1      1     1.00           *            movq  $0, 16(%rsp)
 1      5     0.50    *                   movq  %fs:0, %rax
 1      5     0.50    *                   movq  -8(%rax), %r12
 1      1     1.00           *            movq  $4, (%rsp)
 1      5     0.50    *                   movq  (%r12), %rax
 1      1     1.00           *            movq  %rax, 8(%rsp)
 1      1     0.25                        movq  %rsp, %rax
 1      1     1.00           *            movq  %rax, (%r12)
 1      5     0.50    *                   movq  24(%rdi), %r13
 2      6     0.50    *                   cmpw  $0, -4(%r12)
 1      1     0.50                        jne   .LBB0_3
 1      1     0.50                        leaq  2096615768(%rbx), %rax
 4      3     1.00                        callq *%rax
 1      1     0.25                        testl %eax, %eax
 1      1     0.50                        je    .LBB0_2
 1      5     0.50    *                   movq  16(%r12), %rdi
 1      1     0.25                        movl  $1416, %esi
 1      1     0.25                        movl  $32, %edx
 4      3     1.00                        callq jl_gc_pool_alloc@PLT
 1      1     1.00           *            movq  %rbx, -8(%rax)
 1      1     1.00           *            movq  %r15, (%rax)
 1      1     1.00           *            movq  %r14, 8(%rax)
 1      1     1.00           *            movq  %r13, 16(%rax)
 1      1     1.00           *            movq  %rax, 16(%rsp)
 1      1     1.00           *            movq  %rax, 32(%rsp)
 1      1     0.25                        addq  $1553984672, %rbx
 1      1     1.00           *            movq  %rbx, 40(%rsp)
 1      1     0.50                        leaq  32(%rsp), %rsi
 1      0     0.17                        xorl  %edi, %edi
 1      1     0.25                        movl  $2, %edx
 4      3     1.00                        callq jl_f__call_latest@PLT
 1      5     0.50    *                   movq  8(%rsp), %rax
 1      1     1.00           *            movq  %rax, (%r12)
 1      1     0.25                        addq  $48, %rsp
 2      6     0.50    *                   popq  %rbx
 2      6     0.50    *                   popq  %r12
 2      6     0.50    *                   popq  %r13
 2      6     0.50    *                   popq  %r14
 2      6     0.50    *                   popq  %r15
 3      7     1.00                  U     retq
 1      5     0.50    *                   movq  16(%r12), %rdi
 1      1     0.25                        movl  $1416, %esi
 1      1     0.25                        movl  $32, %edx
 4      3     1.00                        callq jl_gc_pool_alloc@PLT
 1      1     1.00           *            movq  %rbx, -8(%rax)
 1      1     1.00           *            movq  %r15, (%rax)
 1      1     1.00           *            movq  %r14, 8(%rax)
 1      1     1.00           *            movq  %r13, 16(%rax)
 1      1     1.00           *            movq  %rax, 16(%rsp)
 1      1     1.00           *            movq  %rax, 32(%rsp)
 1      1     0.50                        leaq  1614324336(%rbx), %rdi
 1      1     0.25                        addq  $1737599200, %rbx
 1      1     0.50                        leaq  32(%rsp), %rsi
 1      1     0.25                        movl  $1, %edx
 1      1     0.25                        movq  %rbx, %rcx
 4      3     1.00                        callq jl_invoke@PLT
 1      1     0.50                        jmp   .LBB0_4


Resources:
[0]   - SKLDivider
[1]   - SKLFPDivider
[2]   - SKLPort0
[3]   - SKLPort1
[4]   - SKLPort2
[5]   - SKLPort3
[6]   - SKLPort4
[7]   - SKLPort5
[8]   - SKLPort6
[9]   - SKLPort7


Resource pressure per iteration:
[0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    
 -      -     11.34  11.65  14.33  14.68  29.00  11.65  11.36  13.99  

Resource pressure by instruction:
[0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    Instructions:
 -      -      -     0.95   0.01   0.33   1.00   0.03   0.02   0.66   pushq     %r15
 -      -     0.94   0.04   0.65   0.34   1.00   0.01   0.01   0.01   pushq     %r14
 -      -     0.01   0.64    -     0.01   1.00   0.34   0.01   0.99   pushq     %r13
 -      -     0.33   0.33   0.33   0.34   1.00   0.33   0.01   0.33   pushq     %r12
 -      -     0.33   0.33   0.66    -     1.00    -     0.34   0.34   pushq     %rbx
 -      -      -     0.64    -      -      -     0.02   0.34    -     subq      $48, %rsp
 -      -     0.01   0.01    -      -      -     0.32   0.66    -     movq      %rsi, %r14
 -      -     0.03   0.64    -      -      -     0.01   0.32    -     movq      %rdi, %r15
 -      -      -      -      -      -      -      -      -      -     vxorps    %xmm0, %xmm0, %xmm0
 -      -      -      -     0.01   0.01   1.00    -      -     0.98   vmovaps   %xmm0, (%rsp)
 -      -     0.34   0.03    -      -      -     0.31   0.32    -     movabsq   $23008863367056, %rbx
 -      -      -      -     0.64   0.01   1.00    -      -     0.35   movq      $0, 16(%rsp)
 -      -      -      -     0.35   0.65    -      -      -      -     movq      %fs:0, %rax
 -      -      -      -     0.66   0.34    -      -      -      -     movq      -8(%rax), %r12
 -      -      -      -     0.02   0.33   1.00    -      -     0.65   movq      $4, (%rsp)
 -      -      -      -     0.01   0.99    -      -      -      -     movq      (%r12), %rax
 -      -      -      -     0.34   0.33   1.00    -      -     0.33   movq      %rax, 8(%rsp)
 -      -     0.63   0.02    -      -      -     0.34   0.01    -     movq      %rsp, %rax
 -      -      -      -     0.33   0.33   1.00    -      -     0.34   movq      %rax, (%r12)
 -      -      -      -     0.35   0.65    -      -      -      -     movq      24(%rdi), %r13
 -      -     0.35   0.01   0.99   0.01    -     0.32   0.32    -     cmpw      $0, -4(%r12)
 -      -     0.68    -      -      -      -      -     0.32    -     jne       .LBB0_3
 -      -      -     0.33    -      -      -     0.67    -      -     leaq      2096615768(%rbx), %rax
 -      -     0.01   0.34   0.33   0.33   1.00   0.65   1.00   0.34   callq     *%rax
 -      -     0.66   0.01    -      -      -      -     0.33    -     testl     %eax, %eax
 -      -     0.68    -      -      -      -      -     0.32    -     je        .LBB0_2
 -      -      -      -     0.01   0.99    -      -      -      -     movq      16(%r12), %rdi
 -      -     0.01   0.32    -      -      -     0.02   0.65    -     movl      $1416, %esi
 -      -      -     0.34    -      -      -     0.64   0.02    -     movl      $32, %edx
 -      -     1.00   0.96   0.32   0.01   1.00   0.03   0.01   0.67   callq     jl_gc_pool_alloc@PLT
 -      -      -      -     0.02   0.66   1.00    -      -     0.32   movq      %rbx, -8(%rax)
 -      -      -      -     0.33    -     1.00    -      -     0.67   movq      %r15, (%rax)
 -      -      -      -      -     0.65   1.00    -      -     0.35   movq      %r14, 8(%rax)
 -      -      -      -     0.64   0.02   1.00    -      -     0.34   movq      %r13, 16(%rax)
 -      -      -      -     0.02   0.34   1.00    -      -     0.64   movq      %rax, 16(%rsp)
 -      -      -      -     0.33   0.64   1.00    -      -     0.03   movq      %rax, 32(%rsp)
 -      -      -     0.32    -      -      -     0.35   0.33    -     addq      $1553984672, %rbx
 -      -      -      -     0.64   0.03   1.00    -      -     0.33   movq      %rbx, 40(%rsp)
 -      -      -     0.34    -      -      -     0.66    -      -     leaq      32(%rsp), %rsi
 -      -      -      -      -      -      -      -      -      -     xorl      %edi, %edi
 -      -     0.04    -      -      -      -     0.31   0.65    -     movl      $2, %edx
 -      -     0.67   0.63   0.33    -     1.00   0.35   0.35   0.67   callq     jl_f__call_latest@PLT
 -      -      -      -     0.34   0.66    -      -      -      -     movq      8(%rsp), %rax
 -      -      -      -     0.34   0.01   1.00    -      -     0.65   movq      %rax, (%r12)
 -      -     0.03   0.01    -      -      -     0.64   0.32    -     addq      $48, %rsp
 -      -      -     0.64   0.34   0.66    -     0.02   0.34    -     popq      %rbx
 -      -      -     0.35   0.33   0.67    -     0.64   0.01    -     popq      %r12
 -      -      -     0.34   0.02   0.98    -     0.64   0.02    -     popq      %r13
 -      -     0.01   0.34   0.02   0.98    -     0.01   0.64    -     popq      %r14
 -      -     0.33   0.01   0.98   0.02    -     0.63   0.03    -     popq      %r15
 -      -     0.63   0.02   0.67   0.33    -     0.35   1.00    -     retq
 -      -      -      -     0.98   0.02    -      -      -      -     movq      16(%r12), %rdi
 -      -     0.34   0.64    -      -      -     0.01   0.01    -     movl      $1416, %esi
 -      -     0.63   0.02    -      -      -     0.34   0.01    -     movl      $32, %edx
 -      -     0.96   0.63    -     0.32   1.00   0.37   0.04   0.68   callq     jl_gc_pool_alloc@PLT
 -      -      -      -      -      -     1.00    -      -     1.00   movq      %rbx, -8(%rax)
 -      -      -      -     0.64   0.36   1.00    -      -      -     movq      %r15, (%rax)
 -      -      -      -     0.36    -     1.00    -      -     0.64   movq      %r14, 8(%rax)
 -      -      -      -     0.32   0.32   1.00    -      -     0.36   movq      %r13, 16(%rax)
 -      -      -      -     0.32   0.36   1.00    -      -     0.32   movq      %rax, 16(%rsp)
 -      -      -      -     0.35   0.32   1.00    -      -     0.33   movq      %rax, 32(%rsp)
 -      -      -     0.35    -      -      -     0.65    -      -     leaq      1614324336(%rbx), %rdi
 -      -     0.33   0.01    -      -      -      -     0.66    -     addq      $1737599200, %rbx
 -      -      -     0.37    -      -      -     0.63    -      -     leaq      32(%rsp), %rsi
 -      -      -     0.66    -      -      -     0.01   0.33    -     movl      $1, %edx
 -      -     0.63   0.02    -      -      -     0.35    -      -     movq      %rbx, %rcx
 -      -     0.36   0.01    -     0.33   1.00   0.65   0.98   0.67   callq     jl_invoke@PLT
 -      -     0.37    -      -      -      -      -     0.63    -     jmp       .LBB0_4

There are two warnings:

warning: found a call in the input assembly sequence.
note: call instructions are not correctly modeled. Assume a latency of 100cy.
warning: found a return instruction in the input assembly sequence.
note: program counter updates are ignored.

Inspecting the LLVM, I think this refers to calls in which the integers are converted to floats

L13:                                              ; preds = %top
   %16 = call i32 inttoptr (i64 23010959982824 to i32 ()*)()
; │┌ @ operators.jl:278 within `!=`
; ││┌ @ promotion.jl:418 within `==` @ promotion.jl:468
     %.not16 = icmp eq i32 %16, 0

but I do not know if this is the issue.
Regarding the return instruction in the input assembly sequence, I have absolutely no idea what it could mean.

All things said, I am still wondering: Can anyone provide any example of a code that actually scales nearly optimal with the number of cores? I think seeing one example where it really works might help immensely, since then I could slowly make changes to it and see at which points the scaling regressions occur.

mbauman · January 12, 2023, 3:28pm

There are just so many things that can affect multithreaded performance scaling. There’s a heavy interplay between your code and your hardware — and some effects are purely hardware-dependent. Just off the top of my head:

Modern processors might up- or down-clock based upon the workloads. For example, some CPUs use a higher clock-rate when there’s only 1 active core while others might down clock if lots of cores are using particularly expensive instruction sets like AVX-512. This is more common on consumer-grade CPUs, but some server parts do it, too.
Memory bandwidth and cache size effects
False-sharing… this can be particularly pernicious with two socket sorts of configurations, but has big effects on performance (this one won’t be a few %s left on the floor).
Everything else your system is doing gets in the way more frequently as you use more cores, too.

Salmon · January 12, 2023, 4:06pm

Right, this is exactly the reason why I want to start from a scenario which is as ideal as it can be.
Maybe I should be explicit of the things I think am covering by now with my example.
The “ideal” program, I have come up with

has a number of task which is a mutliple of the number of cores, all of which should take the same time → No thread should have to wait much for the others to finish
has a noticeable workload for each thread → The overhead of creating threads should be negligible
does not read a lot from memory → Memory bandwidth and cache size effects should not be the limiting factor.
does not write a lot to memory → False sharing should not be the limiting factor.
runs on a professional HPC cluster using Slurm → the influence any OS interference should be minimal. (I sincerely hope that they also do not do any shenanigans with changing the CPU clockspeed, otherwise optimizing anything would be an utter nightmare)

As I see it, all of these examples should be fulfilled in my minimal working example. If any of them is not, I would be delighted to hear suggestions on how to better approximate the ideal case.
This should in principle mean that there is some other, even more nontrivial effect at play here.
however I am somewhat running out of ideas

robsmith11 · January 12, 2023, 6:05pm

Sorry if I missed it earlier in the thread, but have you actually measured the clock speeds across cores throughout your benchmarks? It should be easy to confirm. I thought most modern CPUs would enable higher frequencies when fewer cores are utilized.

Salmon · January 12, 2023, 6:29pm

Good that you made sure, it was not really mentioned before (at least by me).
Wow, It seems you have finally found my problem!
I have taken a closer look at LIKWID, and it indeed gives different clock speeds depending on the number of cores I assign to the calculation.
If I use 1 core, the CPU clockspeed is at 3046.85 Mhz, while for 48 cores, it is only 2826.02 Mhz. It looks like this accounts exactly for the speedup I am still missing. The “speedup” divided by the number of cores was 44.5/48 = 0.927, while the ratio between the clock speeds is 2826.02/3046.85 = 0.928 …
Thank you so much, this is quite embarrassing, but I had no idea, a cluster would change the clock speed of a job like that.

Seems like I have to change the way I am doing my scaling benchmark, does anyone know if there is a way to fix the CPU frequency? If not how do people usually do scaling tests of HPC code?

Oscar_Smith · January 12, 2023, 6:35pm

You might be able to turn off turbo, but in reality it’s just a hard problem. Modern chips are typically power/thremal constrained so doing more work will somewhat inevitably lower the clocks. One thing you can do is get measurements in terms of cycles which will account for clock differences, but that will change your memory to cpu speed ratio so it’s not perfect.

Oscar_Smith · January 12, 2023, 6:38pm

you also could benchmark by giving the other cores garbage work (e.g. 1 core solving your problem, the rest factoring a large number)

Salmon · January 12, 2023, 6:44pm

hm yeah that might be possible, although I am not so sure how I can restrict how many threads are used for an example. I think in particular it would be good to find some way that does not require very significant code changes as those might of course also affect the applicability of the benchmark in some other way.

StefanKarpinski · January 12, 2023, 6:58pm

Kind of amazing that the ratios are that exact! Also nice to know that the efficiency is actually so close to perfect.

Salmon · January 13, 2023, 9:17am

it seems that on my cluster, this issue can be mitigated somewhat by using the SLURM option

#SBATCH --disable-turbomode

when submitting the job. This disables the frequency to get boosted when the temperature is low enough, i.e. for small load of the node.

The cpu frequency is still not completely stable, but the naive scaling is now nearly perfect.

  1 => 1.0
  2 => 1.9997469597122512
 24 => 23.952572231631184
 48 => 47.6049161324683

correcting for the still slightly different cpu frequency ratios, I get as close to perfect scaling as is reasonable.

  1  => 1.0
  2  => 1.99889
  24 => 23.8487
  48 => 47.874

Thanks again to everyone for their help. I can finally get back to improving the scaling of my original code (which of course should now also be quite a bit better)

Sukera · January 13, 2023, 9:59am

I think that README is outdated and shouldn’t be required anymore Though admittedly, I always compile locally since I switch versions quite often

Oh THAT is a good one - I always forget about it!

There are ways to fix the turbo issues without sacrificing performance (e.g. see here for what I use locally), but that is definitely something you’ll want to discuss with the admins of your cluster before using it. Setting the frequency governor to performance gives me at least VERY stable peak frequency, but you have to be really careful with the thermals of the system. Make sure that your setup can cool the CPU appropriately, to a temperature below the maximum safe one recommended by the manufacturer before using this under sustained heavy load.

I’m surprised this was the problem to be honest, I’d expect a professional cluster to know about & manage that setting explicitly already

Now those numbers look much better

Salmon · January 13, 2023, 10:21am

I believe having read somewhere that it should not be neccessary from Julia 1.8 onwards. If thats the case I will just wait until the admins update the version

Ill talk to the admins about options to keep the frequency stable, I found that the benchmarking is still a bit erratic especially after I included LoopVectorization as well.

carstenbauer · January 13, 2023, 11:07am

It isn’t required anymore for Julia >= 1.9 (see e.g. Using the Intel VTune Profiler with julia - #26 by vchuravy). I’ve updated the README appropriately.

Topic		Replies	Views
Help wanted: benchmarking multi-threaded CPU performance Offtopic hardware	20	954	May 13, 2024
Help me understand multi-threaded scaling for matrix multiplication Performance question	22	658	April 16, 2024
Huge performance fluctuations in parallel benchmark: insights? Performance parallel , multithreading , benchmarktools	52	2635	December 1, 2021
AMD Rome vs Intel Xeon shows bad scaling with threads for AMD Performance multithreading , loopvectorization	20	1872	March 13, 2022
Same multi-threaded code, scaling observed only on some machines Performance	2	74	August 14, 2024

How to achieve perfect scaling with Threads (Julia 1.7.1)

Related topics