How to achieve perfect scaling with Threads (Julia 1.7.1)

Salmon · January 6, 2023, 1:32pm

Hi everyone,
I am currently investigating a scaling bottleneck in one of my codes which seems a bit elusive to me.
In my investigation to get a MWE, I have tried to produce a somewhat representative example, which should model an ideal scenario, where I expect a perfect speedup with the number of threads, i.e. the code run with n cores should be nearly n times as fast as with one.
The speedup I get though, leaves a lot to be desired though in my opinion, for 48 cores I get “only” a 42x speedup.

  # cores => speedup
  1.0 => 1.0
  2.0 => 1.9172675482367358
  4.0 => 3.7349433023688423
  8.0 => 7.4664170157427385
 16.0 => 14.870344814024431
 24.0 => 21.51762804720754
 48.0 => 42.022108406703595

I have let this run on a machine with 48 physical cores (2 Sockets).
Maybe you can help me by answering one or more of the following questions:

Is this perhaps as good as I can expect it to be? If so, why?
Are there adjustments to improve the scaling (not the performance) of this code snippet?
Could this be specific to the environment I am running my code on? Does anyone observe better results with the same code on a different architecture?
Can you share a few ideas what one could do to improve scaling and identify issues more generally when it comes to scaling behaviour?

using BenchmarkTools
# using ThreadPinning
# pinthreads(:cores)

function SolveProb!(res,Arr)
    N1,N2,N3 = size(Arr)
	Threads.@threads for i in eachindex(res)
		x = 0.
		for j in 1:100,k in 1:N1,l in 1:N2, m in 1:N3
            x += exp(-(k+l-m)/100*j)
		end
		res[i] = x
	end
end

NThreadsMax = 48

res = zeros(NThreadsMax)
Arr = rand(50,50,50)
BM = @benchmark SolveProb!($res,$Arr) samples = 5*Threads.nthreads() evals = 10
##
avg(x) = sum(x)/length(x)
println(BM)
@info "BM" BM times = BM.times Threads.nthreads() TotalCPUtime = avg(BM.times)*Threads.nthreads() average = avg(BM.times) min = minimum(BM.times)

I am looking forward to hearing your ideas!

jmair · January 6, 2023, 1:51pm

42x is a pretty good speedup. You will probably get closer to the scaling you expect by increasing the work done by each thread (i.e. larger arrays). Another bottleneck can be memory bandwidth as well.

I would recommend trying to speed up the inner loop with something like LoopVectorization.jl to make sure the performance is really good before parallelization.

Salmon · January 6, 2023, 2:04pm

Thank you for your answer,
I did try to play around with the workload without too much improvement. In the example above each thread spends about 80 ms on its task so I think this might not be the reason for the suboptimal scaling.
While my original code makes use of Loop Vectorization, I have removed it in this example, since it should not impact the scaling behaviour (and possibly make things rather more complicated). Making the inner loops single-thread performance faster is not my current goal.
Or is there another reason why using LoopVectorization will increase the scaling?

jmair · January 6, 2023, 2:06pm

I don’t suspect it will help too much with the scaling, but if you are after better performance by using more cores, you will likely have more success with optimising the inner loop, rather than focusing entirely on improving scaling.

Salmon · January 6, 2023, 2:13pm

Well my motivation for trying out this simplistic example is to find out why my more complicated code is not scaling well. My actual code makes use of the usual tricks to speed up the inner loop so this is not what I am looking for.
Essentially I am looking to find out if memory bandwidth problems or other issues were making my code scale suboptimally, so my approach was to try and construct an example that does not suffer from any of these issues.
I do not quite understand why 42x for 48 cores seems to be my limit…

Sukera · January 6, 2023, 3:20pm

I get this:

julia> [ t => first(res)/x for (t,x) in zip(threads, res) ]
8-element Vector{Pair{Int64, Float64}}:
  1 => 1.0
  2 => 1.8838128479919556
  4 => 3.734344916444656
  8 => 7.403696118830602
 12 => 11.055179511586692
 16 => 10.8725399884433
 20 => 11.070044181091374
 24 => 21.156950540061068

julia> res
8-element Vector{Float64}:
 2.4967111179e9 # 2.497s
 1.3253498725e9
 6.685807481e8
 3.372249587e8
 2.258408482e8
 2.296345767e8
 2.255375929e8
 1.180090256e8 # 118 ms

Since I only have 12 physical cores and there’s still a substantial speedup from SMT to be had with virtual cores, my guess is that there’s quite a bit of waiting for the cores. You could investigate with intel VTune, via IntelITT.jl.

My guess is that a lot of time in the loop is spent converting numbers from integers to other types, unnecessarily so, resulting in less-than-ideal register allocation and ressource usage - achieving perfect scaling without first optimizing the single threaded path is going to be very difficult. One problem is that e.g. the accumulation into x cannot be parallelized, since floating point addition is not associative, increasing the portion of the algorithm that has to run serially.

Also, there’s a call in the middle of your hot loop for exp:

[...]
	movabs	r14, offset exp  <--------------------- load the address of exp
[...]
L624:
	vmovsd	qword ptr [rbp - 56], xmm0
; ││ @ example.jl:11 within `macro expansion`
; ││┌ @ int.jl:97 within `/`
; │││┌ @ float.jl:294 within `float`
; ││││┌ @ float.jl:268 within `AbstractFloat`
; │││││┌ @ float.jl:159 within `Float64`
	vcvtsi2sd	xmm0, xmm3, r13
	mov	qword ptr [rbp - 96], rbx
; │││└└└
; │││ @ int.jl:97 within `/` @ float.jl:411
	vdivsd	xmm0, xmm0, xmm1
; ││└
; ││┌ @ promotion.jl:411 within `*` @ float.jl:410
	vmulsd	xmm0, xmm0, xmm2
; ││└
	call	r14   ; <----------------------------------- call here
	vmovsd	xmm2, qword ptr [rbp - 56]      # xmm2 = mem[0],zero
	vmovsd	xmm1, qword ptr [rbp - 184]     # xmm1 = mem[0],zero
; ││ @ example.jl:12 within `macro expansion`
; ││┌ @ range.jl:891 within `iterate`
; │││┌ @ promotion.jl:499 within `==`
	inc	r13
	dec	r15
; ││└└
; ││ @ example.jl:11 within `macro expansion`
; ││┌ @ float.jl:408 within `+`
	vaddsd	xmm2, xmm2, xmm0
	vmovsd	qword ptr [rbp - 56], xmm2
	vmovsd	xmm2, qword ptr [rbp - 192]     # xmm2 = mem[0],zero
	vmovsd	xmm0, qword ptr [rbp - 56]      # xmm0 = mem[0],zero
; ││└
; ││ @ example.jl:12 within `macro expansion`
	jne	L624

Which is going to mess with register allocation too.

For example, just wrapping the inner loops in @fastmath (not recommended and very likely still very far from optimal), while barely improving the scaling, drastically improves the runtime due to more opportunities for SIMD:

julia> [ t => first(res)/x for (t,x) in zip(threads, res) ]
8-element Vector{Pair{Int64, Float64}}:
  1 => 1.0
  2 => 1.9289688034533892
  4 => 3.818534401501174
  8 => 7.592153468203395
 12 => 11.139237035786895
 16 => 6.802131033243928
 20 => 7.423319939879265
 24 => 11.141187151999336

julia> res
8-element Vector{Float64}:
 6.764364493e8 # 676 ms
 3.506725708e8
 1.77145569e8
 8.90967829e7
 6.07255638e7
 9.94447837e7
 9.11231706e7
 6.07149346e7 # 60ms

So just from an unoptimized inner loop, you’re influencing the scaling behavior dramatically (the similar scaling in the beginning suggests to me that it’s the default work-stealing behavior of Threads.@threads that may also play a role here).

Note that there is some inherent overhead from the scheduler in setting up the threading machinery (e.g. the default :dynamic scheduling for work stealing could be to partially blame) - there may also be some false sharing going on when finally writing into the res array, as that is shared across benchmarks and not reallocated, so there may be caching oddities there. There may also be some overhead due to my machine doing other things in the background (like rendering this browser window, for example), so making sure you are the only application running is of paramount importance. I’ve seen regressions more than 2x due to “idle” background programs causing interference with caches, competing for CPU registers when the benchmarking thread is interrupted etc…

But as mentioned above, best to check with something like MCAnalyzer.jl, IntelITT.jl or LIKWID.jl to see where the bottleneck in your specific code is.

My machine is just this:

julia> versioninfo()
Julia Version 1.10.0-DEV.263
Commit f0aed884a1* (2023-01-03 17:17 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: 24 × AMD Ryzen 9 7900X 12-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, znver3)
  Threads: 24 on 24 virtual cores
Environment:
  JULIA_NUM_THREADS = 24

Salmon · January 6, 2023, 3:49pm

Thank you for all these wonderful ideas, I will try experimenting with the profilers you mentioned.

Ah, I was not aware that this might be a problem for scaling… I was actually just looking for a somewhat expensive function to call that creates some virtual workload so that the loop does not get removed by the compiler… Maybe I will find a suitable replacement. I suppose it is best to only include elementary additions and multiplications in the hot loop? This would also correspond more to my realistic use-case.

I think the scheduler should be static, since the dynamic scheduler was only introduced in Julia 1.8 afaik:
The documentation of version 1.7.1 says here:

The schedule argument can be used to request a particular scheduling policy. The only currently supported value is :static, which creates one task per thread and divides the
iterations equally among them. Specifying :static is an error if used from inside another @threads loop or from a thread other than 1.

The false sharing idea is also a good idea, even though the res array is not accessed in the hot loop, I’ll try to use a padded array just in case, that should get rid of the problem if there is one…

I hope using the tools you mentioned leads me closer to the culprit.

Sukera · January 6, 2023, 3:53pm

Since you’re seeding the loop with an input variable of dynamic size (size(Arr)) and write the result to something that survives both the loop and the call (res[i] = x), I’d expect this to not be folded by the compiler (though a sufficiently smart one may still, this is REALLY unlikely here due to the floating point shenanigans here as well as the type conversions).

Yes - I didn’t know your version, so I assumed you were on the latest stable release (1.8.4).

carstenbauer · January 7, 2023, 7:21am

The accumulation is already within the parallel block and thus thread-local (still task-local with dynamic scheduling), so I don’t understand what you mean here. If you’re talking about lower level parallelism (SIMD) that shouldn’t be relevant at all for scaling (of the given code): for scaling it doesn’t matter if each thread is summing up numbers with scalar or vector instructions. (There’s also no memory access here that could, potentially, profit from vector loads or stores)

Otherwise, lots of good comments. I was going to mention static scheduling and false sharing as well.

carstenbauer · January 7, 2023, 8:20am

@Salmon, which CPU type is this? How many memory domains (NUMA) does it have?

Sukera · January 7, 2023, 8:31am

I disagree - code that is able to SIMD efficiently is much more likely to have favourable branch prediction patterns (or no branches at all, ideally). That could very well influence scaling, depending on the CPU and how its predictor copes with different branching patterns.

Additionally, keeping arguments in the same kinds of registers by not having type conversions means less opportunities for slow one-by-one conversion. So in this case, I think it can be beneficial for scaling as well, as such a type conversion may be a part of the serial portion (on each thread, yes, but nonetheless serial, influencing the ratio of serial to parallel execution).

Keeping things in the same kinds of registers & using the same instructions will also help keep the ops cache hot (image for Zen4), which may influence scaling as well, depending on how large the cache is, how often it needs to be evicted etc, as well as requiring less cross communication between INT and FP hardware (I sadly couldn’t find a die shot of Zen4 ):

Though in the end, my rudimentary test here with throwing @fastmath on the accumulation is obviously insufficient to show any difference due to that - the majority of the remaining scaling is probably from somewhere else, you’re right about that

I wouldn’t expect NUMA to make a difference here, since nothing in the computation is hitting main memory either way This workload is mostly happening in registers already (though the actual usecase that led to this thread may not be )

carstenbauer · January 7, 2023, 9:03am

Thanks for the elaboration, but I’m still not sure I get your point. What is the used shared resource (shared by cores) that could lead to a reduction of scaling (for the given code)? AFAIR, on most architectures, there is one branch predictor per core (am I wrong here?). Assuming that this is true, and we aren’t using SMT, I fail to see how this influences the overall scaling as a function of the number of cores. To me, it still seems that your arguments just try to increase the performance of the “serial portion”, that is, the portion that is not parallelising across cores (it may exploit instruction level parallelism within a core of course).

I guess this statement is where I don’t understand you:

Again assuming that threads are pinned to different cores (i.e. no SMT), whatever is happening on one thread, and hence on one core, doesn’t influence what happens on a different core and thus doesn’t matter for the scaling as a function of the number of cores. This is unless there is a used shared resource, of course, which, again, I fail to see in your elaboration (but it’s also rather early in the morning here ).

Anyways, just to be clear, I definitely agree with the general approach to first optimise the “serial code” and then parallelise it across cores.

You’re absolutely right, I didn’t look at the code carefully enough.

Salmon · January 7, 2023, 9:11am

The CPU is the following:
2× Intel Xeon Platinum 8168 CPU, 2× 24 cores, 2.7 GHz
Currently I could not find the number of NUMA Domains for this, but I’ll have a look and write it here.
It should not matter for this code, but It definitely might be useful for me along the way to optimize my real code, where I am currently still unable to make unprovements with too the large number of potential problems and turning knobs…
Thanks for your contributions so far!

Sukera · January 7, 2023, 9:45am

Well, there must be some shared ressource that allows for more threads to work better, otherwise I wouldn’t see a speedup with 24 threads on my 12 physical core machine in the naive code (see the speedup image I posted above - I do have SMT enabled) They still don’t match the @fastmath version with the same number of threads performance wise, but do allow for continued scaling after it makes physical sense (which to me suggested an unoptimized serial path).

You are correct though that none of the things I mentioned are (should be) actually shared across cores, but something changed with respect to shared ressources when moving to @fastmath, since there the 24 thread version doesn’t beat the 12 thread version anymore (as it should be on 12 physical cores!). My guess is that it’s due to the call introducing some shared state, but figuring that specific nugget out is a harder nut to crack, I think - especially since the scaling up to 12 threads is the same between naive and @fastmath, suggesting the underlying issue to lie in another place.

If you have terminal access, you can just do lscpu | grep NUMA and it should print the NUMA domains.

carstenbauer · January 7, 2023, 10:00am

Since you already seem to have ThreadPinning.jl installed:

using ThreadPinning
nnuma()
nsockets()
ncores()
ncputhreads()
# etc...

Alright, great, so my understanding was correct for the case where we disregard SMT, i.e. consider only one thread per physical core (which is what the OP was interested in, at least I thought so). And you’re concerned about the case nthreads > nphysicalcores (with SMT). SMT, of course, makes all the difference in the world for my elaboration above because now two threads are running within the same core and thus are very likely to share some resources in the core. Understanding this part, in particular the @fastmath results that you’re seeing, is indeed less trivial (and interesting ).

Sukera · January 7, 2023, 11:41am

Yes exactly - the scaling is the same Since the scaling is usually desired to achieve better performance though, I just thought it was interesting to see what actually optimizing the serial code can still achieve (the naive version still can’t beat the serially-optimized @fastmath version after all) - and that at that point, oversubscribing doesn’t help at all!

Regardless, me showing that the scaling is not optimal doesn’t really say anything at all, since I had things running in the background that could have influenced things, depending on how the kernel decided to schedule things.

Salmon · January 7, 2023, 12:41pm

Thanks for the hint. I didnt have access to the machine before but here it is.
A few more infos since it might be important:
I am running this on a cluster, where I submit the code above with a varying number of cores in the Slurm batch script. OS Background processes should not cause an issue afaik.

#-cpus-per-task=48
┌ Info: CPU Info
│   nnuma() = 2
│   nsockets() = 2
│   ncores() = 48
└   ncputhreads() = 96

#-cpus-per-task=1
┌ Info: CPU Info
│   nnuma() = 2
│   nsockets() = 2
│   ncores() = 48
└   ncputhreads() = 96

Actually, seeing this I am slightly puzzled that it tells me ncputhreads() = 96. I did start Julia with nthreads() = 48 though. Can it be that the existence of virtual threads that I never use somehow causes issues? I dont really see why though, but it seems noteworthy.

Elrod · January 7, 2023, 1:12pm

Here is the script I used for benchmarking threads:

using ProgressMeter
bench_threads(prog::String, nthread::Int = Threads.nthreads()) = bench_threads(prog, 1:nthread)
function bench_threads(prog::String, threads::AbstractVector{Int})
  script = prog * "\n" * """
  using BenchmarkTools
  t = @belapsed f()
  print(t)
  """
  let cmd = `$(Base.julia_cmd()) --startup=no --project=$(Base.active_project()) -e $(script)`
    @showprogress map(threads) do nt
      io = IOBuffer()
      run(`$(cmd) -t$(nt)`, stdin, io)
      parse(Float64, String(take!(io)))
    end
  end
end

Pass in a string that defines f, and it’ll benchmark f.

That script is not allowed to print anything to stdout.

I should probably make it using Distributed instead. Then I can pass in a function instead, and transfer the result more directly than using stdout.

Rest of the script

threadcounts = [1, 2, 4, 8, 16, 18];
prog1 = """
function SolveProb!(res,Arr)
  N1,N2,N3 = size(Arr)
  Threads.@threads for i in eachindex(res)
    x = 0.0
    for j in 1:100,k in 1:N1,l in 1:N2, m in 1:N3
      x += exp(-(k+l-m)/100*j)
    end
    res[i] = x
  end
end
const res = zeros($(lcm(threadcounts)));
const Arr = rand(50,50,50);
f() = SolveProb!(res,Arr)
"""
prog2 = """
function SolveProb!(res,Arr)
  N1,N2,N3 = size(Arr)
  Threads.@threads for i in eachindex(res)
    x = 0.0
      @fastmath for j in 1:100,k in 1:N1,l in 1:N2, m in 1:N3
        x += exp(-(k+l-m)/100*j)
      end
    res[i] = x
  end
end
const res = zeros($(lcm(threadcounts)));
const Arr = rand(50,50,50);
f() = SolveProb!(res,Arr)
  """
prog3 = """
using Polyester
function SolveProb!(res,Arr)
  N1,N2,N3 = size(Arr)
  @batch for i in eachindex(res)
    x = 0.0
      @fastmath for j in 1:100,k in 1:N1,l in 1:N2, m in 1:N3
        x += exp(-(k+l-m)/100*j)
      end
    res[i] = x
  end
end
const res = zeros($(lcm(threadcounts)));
const Arr = rand(50,50,50);
f() = SolveProb!(res,Arr)
"""
prog4 = """
using LoopVectorization, Polyester
function SolveProb!(res,Arr)
  N1,N2,N3 = size(Arr)
  @batch for i in eachindex(res)
    x = 0.0
    @turbo for j in 1:100,k in 1:N1,l in 1:N2, m in 1:N3
      x += exp(-(k+l-m)/100*j)
    end
    res[i] = x
  end
end
const res = zeros($(lcm(threadcounts)));
const Arr = rand(50,50,50);
f() = SolveProb!(res,Arr)
"""

times1 = bench_threads(prog1, threadcounts);
times2 = bench_threads(prog2, threadcounts);
times3 = bench_threads(prog3, threadcounts);
times4 = bench_threads(prog4, threadcounts);

printtimes(threadcounts, times1, times2, times3, times4)

using DataFrames
DataFrame(
  Threads = threadcounts,
  var"@threads" = first(times1) ./ times1,
  var"@threads @fastmath " = first(times2) ./ times2,
  var"@batch @fastmath " = first(times3) ./ times3,
  var"@batch @turbo " = first(times4) ./ times4,
)
DataFrame(
  Threads = threadcounts,
  var"@threads" = fill(1, length(times1)),
  var"@threads @fastmath " = times1 ./ times2,
  var"@batch @fastmath " = times1 ./ times3,
  var"@batch @turbo " = times1 ./ times4,
)

Results, speed up vs 1 thread:

 Row │ Threads  @threads  @threads @fastmath   @batch @fastmath   @batch @turbo  
     │ Int64    Float64   Float64              Float64            Float64        
─────┼───────────────────────────────────────────────────────────────────────────
   1 │       1   1.0                  1.0                1.0             1.0
   2 │       2   1.99934              1.99874            1.63739         1.98558
   3 │       4   4.00105              3.33114            3.27491         3.94429
   4 │       8   8.00183              6.37122            6.54904         7.84492
   5 │      16  16.0005              12.7387            13.0951         15.6891
   6 │      18  17.9939              14.324             14.7318         17.7115

Speed up vs @threads:

6×5 DataFrame
 Row │ Threads  @threads  @threads @fastmath   @batch @fastmath   @batch @turbo  
     │ Int64    Int64     Float64              Float64            Float64        
─────┼───────────────────────────────────────────────────────────────────────────
   1 │       1         1              2.70788            2.74307         22.3023
   2 │       2         1              2.70706            2.24648         22.1489
   3 │       4         1              2.25449            2.24524         21.986
   4 │       8         1              2.15606            2.24504         21.865
   5 │      16         1              2.15586            2.24497         21.8683
   6 │      18         1              2.1556             2.24578         21.9523

carstenbauer · January 7, 2023, 2:49pm

All of the ThreadPinning commands give you static information about the system itself and nothing about how your started Julia. ncputhreads specifically tells you how many “hardware threads” (also referred to as “virtual cores”) your system has. Again, it doesn’t tell you anything about which of those is actually used (to get this information use getcpuids() or even better, threadinfo().

sgaure · January 7, 2023, 3:10pm

Yes. If virtual threads are enabled in the OS, it is entirely possible that some of julia’s threads share a physical core. It depends on OS and slurm configuration and things. You could try to experiment with taskset when you start julia to make sure the 48 virtual cores julia sees are on 48 different physical cores.

Topic		Replies	Views
Decrease in performance using Threads.@threads in Linux Julia at Scale	16	2011	July 23, 2019
Thread overhead variability across machines Performance	13	1873	November 28, 2017
Poor performance while multithreading (Julia 1.0) Performance multithreading	28	4013	February 11, 2019
Huge performance fluctuations in parallel benchmark: insights? Performance parallel , multithreading , benchmarktools	52	2686	December 1, 2021
V1.3.1 no gain using multithread General Usage multithreading	33	2077	March 4, 2020

How to achieve perfect scaling with Threads (Julia 1.7.1)

Related topics