While it is commented out in his OP, he already has pinthreads(:cores)
there. Given that Threads.nthreads() <= ncores()
this ensures that Julia threads are pinned to different cores.
Hey again,
I have had a look into many of your suggestions and would like to follow up.
First, instead of using exp
, I now perform some simpler floating point calculations instead. It might be that there was indeed a bit of thread scaling left on the table due to the call to exp. The scaling is now slightly improved to 44.48x for 48 cores.
For reference, here is the slightly simpified code I am trying out:
Summary
using ThreadPinning
pinthreads(:cores)
using BenchmarkTools
using LIKWID
function SolveProb!(res,N1)
Threads.@threads for i in eachindex(res)
x = 0.
for k in 1:N1
kfl = float(k)
for j in 1:500
x += 1e-10 * kfl
end
end
res[i] = x
end
end
NThreadsMax = 48
res = zeros(NThreadsMax)
N1 = 100_000
BM1 = @perfmon "FLOPS_DP" SolveProb!(res,N1)
BM = @benchmark SolveProb!($res,$N1) samples = 5*Threads.nthreads() evals = 5 seconds = 5*60
##
avg(x) = sum(x)/length(x)
println(BM)
@info "BM" BM times = BM.times Threads.nthreads() TotalCPUtime = avg(BM.times)*Threads.nthreads() average = avg(BM.times) min = minimum(BM.times)
##
Still, I believe I can understand more from this example.
In particular, I tried out LIKWID.jl
and MCAnalyzer.jl
although I need a bit of help in interpreting the output. IntelITT
seems hard to access, since it requires one to compile Julia from scratch with specific flags which I cannot really do on my HPC cluster. Perhaps there is another way?
In particular, using LIKWIDβs @perfmon
macro, I get the following (here for only four cores to not clutter this text too much)
Summary
Group: FLOPS_DP
ββββββββββββββββββββββββββββββββββββββββββββ¬ββββββββββββ¬βββββββββββ¬βββββββββββ¬βββββββββββ
β Event β Thread 1 β Thread 2 β Thread 3 β Thread 4 β
ββββββββββββββββββββββββββββββββββββββββββββΌββββββββββββΌβββββββββββΌβββββββββββΌβββββββββββ€
β INSTR_RETIRED_ANY β 2.67259e8 β 0.0 β 0.0 β 0.0 β
β CPU_CLK_UNHALTED_CORE β 2.49107e8 β 0.0 β 0.0 β 0.0 β
β CPU_CLK_UNHALTED_REF β 2.13958e8 β 0.0 β 0.0 β 0.0 β
β FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE β 0.0 β 0.0 β 0.0 β 0.0 β
β FP_ARITH_INST_RETIRED_SCALAR_DOUBLE β 5.16819e7 β 0.0 β 0.0 β 0.0 β
β FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE β 0.0 β 0.0 β 0.0 β 0.0 β
β FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE β 0.0 β 0.0 β 0.0 β 0.0 β
ββββββββββββββββββββββββββββββββββββββββββββ΄ββββββββββββ΄βββββββββββ΄βββββββββββ΄βββββββββββ
ββββββββββββββββββββββββ¬βββββββββββ¬βββββββββββ¬βββββββββββ¬βββββββββββ
β Metric β Thread 1 β Thread 2 β Thread 3 β Thread 4 β
ββββββββββββββββββββββββΌβββββββββββΌβββββββββββΌβββββββββββΌβββββββββββ€
β Runtime (RDTSC) [s] β 0.134637 β 0.134637 β 0.134637 β 0.134637 β
β Runtime unhalted [s] β 0.103796 β 0.0 β 0.0 β 0.0 β
β Clock [MHz] β 2794.24 β NaN β NaN β NaN β
β CPI β 0.93208 β NaN β NaN β NaN β
β DP [MFLOP/s] β 383.862 β 0.0 β 0.0 β 0.0 β
β AVX DP [MFLOP/s] β 0.0 β 0.0 β 0.0 β 0.0 β
β AVX512 DP [MFLOP/s] β 0.0 β 0.0 β 0.0 β 0.0 β
β Packed [MUOPS/s] β 0.0 β 0.0 β 0.0 β 0.0 β
β Scalar [MUOPS/s] β 383.862 β 0.0 β 0.0 β 0.0 β
β Vectorization ratio β 0.0 β NaN β NaN β NaN β
ββββββββββββββββββββββββ΄βββββββββββ΄βββββββββββ΄βββββββββββ΄βββββββββββ
Can anyone tell me how to interpret this? To me it looks as if the other threads are not even doing anything which is of course nonsense, since then there would be no speedup at all.
Also, which of these metrics can I use to detect problems that hinder perfect speedup?
In terms of MCAnalyzer
, there is also not too much to say for me:
Summary
julia> analyze(SolveProb!,(Vector{Float64},Int))
warning: found a call in the input assembly sequence.
note: call instructions are not correctly modeled. Assume a latency of 100cy.
warning: found a return instruction in the input assembly sequence.
note: program counter updates are ignored.
Iterations: 100
Instructions: 6800
Total Cycles: 5368
Total uOps: 10200
Dispatch Width: 6
uOps Per Cycle: 1.90
IPC: 1.27
Block RThroughput: 29.0
Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)
[1] [2] [3] [4] [5] [6] Instructions:
3 2 1.00 * pushq %r15
3 2 1.00 * pushq %r14
3 2 1.00 * pushq %r13
3 2 1.00 * pushq %r12
3 2 1.00 * pushq %rbx
1 1 0.25 subq $48, %rsp
1 1 0.25 movq %rsi, %r14
1 1 0.25 movq %rdi, %r15
1 0 0.17 vxorps %xmm0, %xmm0, %xmm0
2 1 1.00 * vmovaps %xmm0, (%rsp)
1 1 0.25 movabsq $23008863367056, %rbx
1 1 1.00 * movq $0, 16(%rsp)
1 5 0.50 * movq %fs:0, %rax
1 5 0.50 * movq -8(%rax), %r12
1 1 1.00 * movq $4, (%rsp)
1 5 0.50 * movq (%r12), %rax
1 1 1.00 * movq %rax, 8(%rsp)
1 1 0.25 movq %rsp, %rax
1 1 1.00 * movq %rax, (%r12)
1 5 0.50 * movq 24(%rdi), %r13
2 6 0.50 * cmpw $0, -4(%r12)
1 1 0.50 jne .LBB0_3
1 1 0.50 leaq 2096615768(%rbx), %rax
4 3 1.00 callq *%rax
1 1 0.25 testl %eax, %eax
1 1 0.50 je .LBB0_2
1 5 0.50 * movq 16(%r12), %rdi
1 1 0.25 movl $1416, %esi
1 1 0.25 movl $32, %edx
4 3 1.00 callq jl_gc_pool_alloc@PLT
1 1 1.00 * movq %rbx, -8(%rax)
1 1 1.00 * movq %r15, (%rax)
1 1 1.00 * movq %r14, 8(%rax)
1 1 1.00 * movq %r13, 16(%rax)
1 1 1.00 * movq %rax, 16(%rsp)
1 1 1.00 * movq %rax, 32(%rsp)
1 1 0.25 addq $1553984672, %rbx
1 1 1.00 * movq %rbx, 40(%rsp)
1 1 0.50 leaq 32(%rsp), %rsi
1 0 0.17 xorl %edi, %edi
1 1 0.25 movl $2, %edx
4 3 1.00 callq jl_f__call_latest@PLT
1 5 0.50 * movq 8(%rsp), %rax
1 1 1.00 * movq %rax, (%r12)
1 1 0.25 addq $48, %rsp
2 6 0.50 * popq %rbx
2 6 0.50 * popq %r12
2 6 0.50 * popq %r13
2 6 0.50 * popq %r14
2 6 0.50 * popq %r15
3 7 1.00 U retq
1 5 0.50 * movq 16(%r12), %rdi
1 1 0.25 movl $1416, %esi
1 1 0.25 movl $32, %edx
4 3 1.00 callq jl_gc_pool_alloc@PLT
1 1 1.00 * movq %rbx, -8(%rax)
1 1 1.00 * movq %r15, (%rax)
1 1 1.00 * movq %r14, 8(%rax)
1 1 1.00 * movq %r13, 16(%rax)
1 1 1.00 * movq %rax, 16(%rsp)
1 1 1.00 * movq %rax, 32(%rsp)
1 1 0.50 leaq 1614324336(%rbx), %rdi
1 1 0.25 addq $1737599200, %rbx
1 1 0.50 leaq 32(%rsp), %rsi
1 1 0.25 movl $1, %edx
1 1 0.25 movq %rbx, %rcx
4 3 1.00 callq jl_invoke@PLT
1 1 0.50 jmp .LBB0_4
Resources:
[0] - SKLDivider
[1] - SKLFPDivider
[2] - SKLPort0
[3] - SKLPort1
[4] - SKLPort2
[5] - SKLPort3
[6] - SKLPort4
[7] - SKLPort5
[8] - SKLPort6
[9] - SKLPort7
Resource pressure per iteration:
[0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
- - 11.34 11.65 14.33 14.68 29.00 11.65 11.36 13.99
Resource pressure by instruction:
[0] [1] [2] [3] [4] [5] [6] [7] [8] [9] Instructions:
- - - 0.95 0.01 0.33 1.00 0.03 0.02 0.66 pushq %r15
- - 0.94 0.04 0.65 0.34 1.00 0.01 0.01 0.01 pushq %r14
- - 0.01 0.64 - 0.01 1.00 0.34 0.01 0.99 pushq %r13
- - 0.33 0.33 0.33 0.34 1.00 0.33 0.01 0.33 pushq %r12
- - 0.33 0.33 0.66 - 1.00 - 0.34 0.34 pushq %rbx
- - - 0.64 - - - 0.02 0.34 - subq $48, %rsp
- - 0.01 0.01 - - - 0.32 0.66 - movq %rsi, %r14
- - 0.03 0.64 - - - 0.01 0.32 - movq %rdi, %r15
- - - - - - - - - - vxorps %xmm0, %xmm0, %xmm0
- - - - 0.01 0.01 1.00 - - 0.98 vmovaps %xmm0, (%rsp)
- - 0.34 0.03 - - - 0.31 0.32 - movabsq $23008863367056, %rbx
- - - - 0.64 0.01 1.00 - - 0.35 movq $0, 16(%rsp)
- - - - 0.35 0.65 - - - - movq %fs:0, %rax
- - - - 0.66 0.34 - - - - movq -8(%rax), %r12
- - - - 0.02 0.33 1.00 - - 0.65 movq $4, (%rsp)
- - - - 0.01 0.99 - - - - movq (%r12), %rax
- - - - 0.34 0.33 1.00 - - 0.33 movq %rax, 8(%rsp)
- - 0.63 0.02 - - - 0.34 0.01 - movq %rsp, %rax
- - - - 0.33 0.33 1.00 - - 0.34 movq %rax, (%r12)
- - - - 0.35 0.65 - - - - movq 24(%rdi), %r13
- - 0.35 0.01 0.99 0.01 - 0.32 0.32 - cmpw $0, -4(%r12)
- - 0.68 - - - - - 0.32 - jne .LBB0_3
- - - 0.33 - - - 0.67 - - leaq 2096615768(%rbx), %rax
- - 0.01 0.34 0.33 0.33 1.00 0.65 1.00 0.34 callq *%rax
- - 0.66 0.01 - - - - 0.33 - testl %eax, %eax
- - 0.68 - - - - - 0.32 - je .LBB0_2
- - - - 0.01 0.99 - - - - movq 16(%r12), %rdi
- - 0.01 0.32 - - - 0.02 0.65 - movl $1416, %esi
- - - 0.34 - - - 0.64 0.02 - movl $32, %edx
- - 1.00 0.96 0.32 0.01 1.00 0.03 0.01 0.67 callq jl_gc_pool_alloc@PLT
- - - - 0.02 0.66 1.00 - - 0.32 movq %rbx, -8(%rax)
- - - - 0.33 - 1.00 - - 0.67 movq %r15, (%rax)
- - - - - 0.65 1.00 - - 0.35 movq %r14, 8(%rax)
- - - - 0.64 0.02 1.00 - - 0.34 movq %r13, 16(%rax)
- - - - 0.02 0.34 1.00 - - 0.64 movq %rax, 16(%rsp)
- - - - 0.33 0.64 1.00 - - 0.03 movq %rax, 32(%rsp)
- - - 0.32 - - - 0.35 0.33 - addq $1553984672, %rbx
- - - - 0.64 0.03 1.00 - - 0.33 movq %rbx, 40(%rsp)
- - - 0.34 - - - 0.66 - - leaq 32(%rsp), %rsi
- - - - - - - - - - xorl %edi, %edi
- - 0.04 - - - - 0.31 0.65 - movl $2, %edx
- - 0.67 0.63 0.33 - 1.00 0.35 0.35 0.67 callq jl_f__call_latest@PLT
- - - - 0.34 0.66 - - - - movq 8(%rsp), %rax
- - - - 0.34 0.01 1.00 - - 0.65 movq %rax, (%r12)
- - 0.03 0.01 - - - 0.64 0.32 - addq $48, %rsp
- - - 0.64 0.34 0.66 - 0.02 0.34 - popq %rbx
- - - 0.35 0.33 0.67 - 0.64 0.01 - popq %r12
- - - 0.34 0.02 0.98 - 0.64 0.02 - popq %r13
- - 0.01 0.34 0.02 0.98 - 0.01 0.64 - popq %r14
- - 0.33 0.01 0.98 0.02 - 0.63 0.03 - popq %r15
- - 0.63 0.02 0.67 0.33 - 0.35 1.00 - retq
- - - - 0.98 0.02 - - - - movq 16(%r12), %rdi
- - 0.34 0.64 - - - 0.01 0.01 - movl $1416, %esi
- - 0.63 0.02 - - - 0.34 0.01 - movl $32, %edx
- - 0.96 0.63 - 0.32 1.00 0.37 0.04 0.68 callq jl_gc_pool_alloc@PLT
- - - - - - 1.00 - - 1.00 movq %rbx, -8(%rax)
- - - - 0.64 0.36 1.00 - - - movq %r15, (%rax)
- - - - 0.36 - 1.00 - - 0.64 movq %r14, 8(%rax)
- - - - 0.32 0.32 1.00 - - 0.36 movq %r13, 16(%rax)
- - - - 0.32 0.36 1.00 - - 0.32 movq %rax, 16(%rsp)
- - - - 0.35 0.32 1.00 - - 0.33 movq %rax, 32(%rsp)
- - - 0.35 - - - 0.65 - - leaq 1614324336(%rbx), %rdi
- - 0.33 0.01 - - - - 0.66 - addq $1737599200, %rbx
- - - 0.37 - - - 0.63 - - leaq 32(%rsp), %rsi
- - - 0.66 - - - 0.01 0.33 - movl $1, %edx
- - 0.63 0.02 - - - 0.35 - - movq %rbx, %rcx
- - 0.36 0.01 - 0.33 1.00 0.65 0.98 0.67 callq jl_invoke@PLT
- - 0.37 - - - - - 0.63 - jmp .LBB0_4
There are two warnings:
warning: found a call in the input assembly sequence.
note: call instructions are not correctly modeled. Assume a latency of 100cy.
warning: found a return instruction in the input assembly sequence.
note: program counter updates are ignored.
Inspecting the LLVM, I think this refers to calls in which the integers are converted to floats
L13: ; preds = %top
%16 = call i32 inttoptr (i64 23010959982824 to i32 ()*)()
; ββ @ operators.jl:278 within `!=`
; βββ @ promotion.jl:418 within `==` @ promotion.jl:468
%.not16 = icmp eq i32 %16, 0
but I do not know if this is the issue.
Regarding the return instruction in the input assembly sequence, I have absolutely no idea what it could mean.
All things said, I am still wondering: Can anyone provide any example of a code that actually scales nearly optimal with the number of cores? I think seeing one example where it really works might help immensely, since then I could slowly make changes to it and see at which points the scaling regressions occur.
There are just so many things that can affect multithreaded performance scaling. Thereβs a heavy interplay between your code and your hardware β and some effects are purely hardware-dependent. Just off the top of my head:
- Modern processors might up- or down-clock based upon the workloads. For example, some CPUs use a higher clock-rate when thereβs only 1 active core while others might down clock if lots of cores are using particularly expensive instruction sets like AVX-512. This is more common on consumer-grade CPUs, but some server parts do it, too.
- Memory bandwidth and cache size effects
- False-sharingβ¦ this can be particularly pernicious with two socket sorts of configurations, but has big effects on performance (this one wonβt be a few %s left on the floor).
- Everything else your system is doing gets in the way more frequently as you use more cores, too.
Right, this is exactly the reason why I want to start from a scenario which is as ideal as it can be.
Maybe I should be explicit of the things I think am covering by now with my example.
The βidealβ program, I have come up with
- has a number of task which is a mutliple of the number of cores, all of which should take the same time β No thread should have to wait much for the others to finish
- has a noticeable workload for each thread β The overhead of creating threads should be negligible
- does not read a lot from memory β Memory bandwidth and cache size effects should not be the limiting factor.
- does not write a lot to memory β False sharing should not be the limiting factor.
- runs on a professional HPC cluster using Slurm β the influence any OS interference should be minimal. (I sincerely hope that they also do not do any shenanigans with changing the CPU clockspeed, otherwise optimizing anything would be an utter nightmare)
As I see it, all of these examples should be fulfilled in my minimal working example. If any of them is not, I would be delighted to hear suggestions on how to better approximate the ideal case.
This should in principle mean that there is some other, even more nontrivial effect at play here.
however I am somewhat running out of ideas
Sorry if I missed it earlier in the thread, but have you actually measured the clock speeds across cores throughout your benchmarks? It should be easy to confirm. I thought most modern CPUs would enable higher frequencies when fewer cores are utilized.
Good that you made sure, it was not really mentioned before (at least by me).
Wow, It seems you have finally found my problem!
I have taken a closer look at LIKWID, and it indeed gives different clock speeds depending on the number of cores I assign to the calculation.
If I use 1 core, the CPU clockspeed is at 3046.85 Mhz
, while for 48 cores, it is only 2826.02 Mhz
. It looks like this accounts exactly for the speedup I am still missing. The βspeedupβ divided by the number of cores was 44.5/48 = 0.927
, while the ratio between the clock speeds is 2826.02/3046.85 = 0.928
β¦
Thank you so much, this is quite embarrassing, but I had no idea, a cluster would change the clock speed of a job like that.
Seems like I have to change the way I am doing my scaling benchmark, does anyone know if there is a way to fix the CPU frequency? If not how do people usually do scaling tests of HPC code?
You might be able to turn off turbo, but in reality itβs just a hard problem. Modern chips are typically power/thremal constrained so doing more work will somewhat inevitably lower the clocks. One thing you can do is get measurements in terms of cycles which will account for clock differences, but that will change your memory to cpu speed ratio so itβs not perfect.
you also could benchmark by giving the other cores garbage work (e.g. 1 core solving your problem, the rest factoring a large number)
hm yeah that might be possible, although I am not so sure how I can restrict how many threads are used for an example. I think in particular it would be good to find some way that does not require very significant code changes as those might of course also affect the applicability of the benchmark in some other way.
Kind of amazing that the ratios are that exact! Also nice to know that the efficiency is actually so close to perfect.
it seems that on my cluster, this issue can be mitigated somewhat by using the SLURM option
#SBATCH --disable-turbomode
when submitting the job. This disables the frequency to get boosted when the temperature is low enough, i.e. for small load of the node.
The cpu frequency is still not completely stable, but the naive scaling is now nearly perfect.
1 => 1.0
2 => 1.9997469597122512
24 => 23.952572231631184
48 => 47.6049161324683
correcting for the still slightly different cpu frequency ratios, I get as close to perfect scaling as is reasonable.
1 => 1.0
2 => 1.99889
24 => 23.8487
48 => 47.874
Thanks again to everyone for their help. I can finally get back to improving the scaling of my original code (which of course should now also be quite a bit better)
I think that README is outdated and shouldnβt be required anymore Though admittedly, I always compile locally since I switch versions quite often
Oh THAT is a good one - I always forget about it!
There are ways to fix the turbo issues without sacrificing performance (e.g. see here for what I use locally), but that is definitely something youβll want to discuss with the admins of your cluster before using it. Setting the frequency governor to performance
gives me at least VERY stable peak frequency, but you have to be really careful with the thermals of the system. Make sure that your setup can cool the CPU appropriately, to a temperature below the maximum safe one recommended by the manufacturer before using this under sustained heavy load.
Iβm surprised this was the problem to be honest, Iβd expect a professional cluster to know about & manage that setting explicitly already
Now those numbers look much better
I believe having read somewhere that it should not be neccessary from Julia 1.8 onwards. If thats the case I will just wait until the admins update the version
Ill talk to the admins about options to keep the frequency stable, I found that the benchmarking is still a bit erratic especially after I included LoopVectorization as well.
It isnβt required anymore for Julia >= 1.9 (see e.g. Using the Intel VTune Profiler with julia - #26 by vchuravy). Iβve updated the README appropriately.