I get this:
julia> [ t => first(res)/x for (t,x) in zip(threads, res) ]
8-element Vector{Pair{Int64, Float64}}:
1 => 1.0
2 => 1.8838128479919556
4 => 3.734344916444656
8 => 7.403696118830602
12 => 11.055179511586692
16 => 10.8725399884433
20 => 11.070044181091374
24 => 21.156950540061068
julia> res
8-element Vector{Float64}:
2.4967111179e9 # 2.497s
1.3253498725e9
6.685807481e8
3.372249587e8
2.258408482e8
2.296345767e8
2.255375929e8
1.180090256e8 # 118 ms
Since I only have 12 physical cores and there’s still a substantial speedup from SMT to be had with virtual cores, my guess is that there’s quite a bit of waiting for the cores. You could investigate with intel VTune, via IntelITT.jl.
My guess is that a lot of time in the loop is spent converting numbers from integers to other types, unnecessarily so, resulting in less-than-ideal register allocation and ressource usage - achieving perfect scaling without first optimizing the single threaded path is going to be very difficult. One problem is that e.g. the accumulation into x
cannot be parallelized, since floating point addition is not associative, increasing the portion of the algorithm that has to run serially.
Also, there’s a call
in the middle of your hot loop for exp
:
[...]
movabs r14, offset exp <--------------------- load the address of exp
[...]
L624:
vmovsd qword ptr [rbp - 56], xmm0
; ││ @ example.jl:11 within `macro expansion`
; ││┌ @ int.jl:97 within `/`
; │││┌ @ float.jl:294 within `float`
; ││││┌ @ float.jl:268 within `AbstractFloat`
; │││││┌ @ float.jl:159 within `Float64`
vcvtsi2sd xmm0, xmm3, r13
mov qword ptr [rbp - 96], rbx
; │││└└└
; │││ @ int.jl:97 within `/` @ float.jl:411
vdivsd xmm0, xmm0, xmm1
; ││└
; ││┌ @ promotion.jl:411 within `*` @ float.jl:410
vmulsd xmm0, xmm0, xmm2
; ││└
call r14 ; <----------------------------------- call here
vmovsd xmm2, qword ptr [rbp - 56] # xmm2 = mem[0],zero
vmovsd xmm1, qword ptr [rbp - 184] # xmm1 = mem[0],zero
; ││ @ example.jl:12 within `macro expansion`
; ││┌ @ range.jl:891 within `iterate`
; │││┌ @ promotion.jl:499 within `==`
inc r13
dec r15
; ││└└
; ││ @ example.jl:11 within `macro expansion`
; ││┌ @ float.jl:408 within `+`
vaddsd xmm2, xmm2, xmm0
vmovsd qword ptr [rbp - 56], xmm2
vmovsd xmm2, qword ptr [rbp - 192] # xmm2 = mem[0],zero
vmovsd xmm0, qword ptr [rbp - 56] # xmm0 = mem[0],zero
; ││└
; ││ @ example.jl:12 within `macro expansion`
jne L624
Which is going to mess with register allocation too.
For example, just wrapping the inner loops in @fastmath
(not recommended and very likely still very far from optimal), while barely improving the scaling, drastically improves the runtime due to more opportunities for SIMD:
julia> [ t => first(res)/x for (t,x) in zip(threads, res) ]
8-element Vector{Pair{Int64, Float64}}:
1 => 1.0
2 => 1.9289688034533892
4 => 3.818534401501174
8 => 7.592153468203395
12 => 11.139237035786895
16 => 6.802131033243928
20 => 7.423319939879265
24 => 11.141187151999336
julia> res
8-element Vector{Float64}:
6.764364493e8 # 676 ms
3.506725708e8
1.77145569e8
8.90967829e7
6.07255638e7
9.94447837e7
9.11231706e7
6.07149346e7 # 60ms
So just from an unoptimized inner loop, you’re influencing the scaling behavior dramatically (the similar scaling in the beginning suggests to me that it’s the default work-stealing behavior of Threads.@threads
that may also play a role here).
Note that there is some inherent overhead from the scheduler in setting up the threading machinery (e.g. the default :dynamic
scheduling for work stealing could be to partially blame) - there may also be some false sharing going on when finally writing into the res
array, as that is shared across benchmarks and not reallocated, so there may be caching oddities there. There may also be some overhead due to my machine doing other things in the background (like rendering this browser window, for example), so making sure you are the only application running is of paramount importance. I’ve seen regressions more than 2x due to “idle” background programs causing interference with caches, competing for CPU registers when the benchmarking thread is interrupted etc…
But as mentioned above, best to check with something like MCAnalyzer.jl
, IntelITT.jl
or LIKWID.jl
to see where the bottleneck in your specific code is.
My machine is just this:
julia> versioninfo()
Julia Version 1.10.0-DEV.263
Commit f0aed884a1* (2023-01-03 17:17 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: 24 × AMD Ryzen 9 7900X 12-Core Processor
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-14.0.6 (ORCJIT, znver3)
Threads: 24 on 24 virtual cores
Environment:
JULIA_NUM_THREADS = 24