Scaling of @threads for "embarrassingly parallel" problem

I’m having a “trivially parallel” an “embarrassingly parallel” problem, where I’m running the following function:

function map_fidelity(potential_depth_values, separation_time_values; kwargs...)
    N = length(potential_depth_values)
    M = length(separation_time_values)
    F = zeros(N, M)
    @inbounds Threads.@threads for j = 1:M
        @inbounds for i = 1:N
            t_r = separation_time_values[j]
            V0 = potential_depth_values[i]
            F[i, j] = propagate_splitting(t_r, V0; kwargs...)
        end
    end
    return F
end

The number of values, N and M are 22 and 13, each call to propagate_splitting takes on the order of a second (between 0.5 and 3 seconds, depending on the values). I was expecting that the runtime should scale almost perfectly linear with the number of threads, since each outer loop body has a relatively long runtime. However, I’m measuring the following runtimes, with JULIA_EXCLUSIVE=1 julia --project=. -t n benchmark.jl:

  • 1 Thread: 267 sec
  • 2 Threads: 200 sec
  • 4 Threads: 170 sec
  • 8 Threads: 152 sec

Each runtime is ± a couple of seconds. I can check with tmux that the process runs at n * 100% for n threads. There is no I/O in propagate_splitting or anything like that. The runtime with 1 thread is pretty close to the runtime for a version where I completely remove Threads.@threads.

Is there anything obvious I’m missing here that would explain why the scaling is so terrible? Any way I can get this to run in 35 sec or so with 8 threads?

2 Likes

Needs to know what propagate splitting is actually doing

1 Like

What about if you replace propagates_splitting with a sleep to simulate work? Any difference?

Perhaps it’s not a problem here, but in general don’t put @inbounds in front of @threads: Warn `@inbounds @threads for` pattern etc. during macro expansion by tkf · Pull Request #41893 · JuliaLang/julia · GitHub

1 Like

My first instinct here is to modify your code as follows. The most important part is assigning t_r outside of the inner loop.

function map_fidelity2(potential_depth_values, separation_time_values; kwargs...)
    N = length(potential_depth_values)
    M = length(separation_time_values)
    F = zeros(N, M)
    Threads.@threads for j in eachindex(separation_time_values)
        @inbounds t_r = separation_time_values[j]
        @inbounds for i in eachindex(potential_depth_values)
            V0 = potential_depth_values[i]
            F[i, j] = propagate_splitting(t_r, V0; kwargs...)
        end
    end
    return F
end

There could be some false sharing issues here.

To address false sharing, we might try allocating an independent vector per task, and then assemble the matrix at the end.

function map_fidelity3(potential_depth_values, separation_time_values; kwargs...)
    N = length(potential_depth_values)
    M = length(separation_time_values)
    F = Vector{Vector{Float64}}(undef, M)
    Threads.@threads for j in eachindex(separation_time_values)
        @inbounds t_r = separation_time_values[j]
        F_j = zeros(N)
        @inbounds for i in eachindex(potential_depth_values)
            V0 = potential_depth_values[i]
            F_j[i] = propagate_splitting(t_r, V0; kwargs...)
        end
        F[j] = F_j
    end
    F = hcat(F_j...)
    return F
end

I’m not an expert at addressing false sharing issues, so perhaps there is a better way.

Let’s try to setup a minimum working example (MWE) so others can actually try to execute something. As others have said, we probably need to know more about propagate_splitting. For now, I’ll just substitute sleep in:

julia> begin
           const V0 = 1.5
           const N, M = 22, 13
           propagate_splitting(t_r, V0) = (sleep(1); t_r + V0)
           
           potential_depth_values = rand(N)
           separation_time_values = rand(M)
       end;

given how slow the inner most loop is I doubt it’s an false sharing issue. Also, OP was already using the correct loop ordering – different threads are accessing different column of the matrix, so unless the matrix is very small (col length < 8?) it further reduces the likelihood of false sharing.

1 Like

Just to get some more information, what processor are you using?

Threads.@threads is applied only to the M loop, which is 13 iterations.
One thread doing 6 and another doing 7 should still get much better speedup than you’re seeing.

But, once you’re using 8 threads (or more!), it’d make sense IMO to @threads :dynamic both the inner and outer loop.

Of course, you do note that you saw 100% utilization.

Without more information, or better yet a minimal example, it’s hard for us to do more than speculate in the dark.

1 Like

Thanks everyone for their suggestions!

@jling:

Needs to know what propagate splitting is actually doing

Well, not exactly something trivial, but also not something crazy: initialize a model for a quantum system and a complex vector of size 1024, and then do a whole lot of FFT/iFFT, vector-vector-multiplication (or rather Diagonal matrix - vector multiplication) and elementwise exponentiation.

I put the full code for the benchmark here: GitHub - goerz-testing/2023-01_rotating_tai_benchmark: Benchmarking of parallelization for rotating TAI

It’s “minimal”, but still somewhat complicated. Maybe I can do a self-contained single file that does something bogus but structurally similar to the original code. You should be able to run it by just instantiating the project, as described in the README.

@Elrod:

Threads.@threads is applied only to the M loop, which is 13 iterations.
One thread doing 6 and another doing 7 should still get much better speedup than you’re seeing.

Yeah, but I’m generally OK with that. Actually, a potentially larger issue is that the different calls to propagate_splitting have a significantly different runtime, and I think potentially all “short” runtimes end up in one thread and all “long” runtimes in another. To avoid both of these issues, I tried settting N=M=16 and I’m also using a constant value for the input arrays, that that every call actually does the exact same thing. This is “benchmark 2”. This brings out an even worse scaling:

:> JULIA_EXCLUSIVE=1 julia --project=. -t 1 benchmark2.jl
153.666410 seconds (290.36 M allocations: 410.962 GiB, 5.05% gc time, 0.04% compilation time)
:> JULIA_EXCLUSIVE=1 julia --project=. -t 2 benc
 98.435518 seconds (309.35 M allocations: 411.444 GiB, 6.31% gc time, 0.08% compilation time)
:> JULIA_EXCLUSIVE=1 julia --project=. -t 4 benchmark2.jl
 99.298586 seconds (333.82 M allocations: 412.088 GiB, 5.35% gc time, 0.08% compilation time)
:> JULIA_EXCLUSIVE=1 julia --project=. -t 8 benchmark2.jl
121.498563 seconds (374.74 M allocations: 413.286 GiB, 3.96% gc time, 0.07% compilation time)

It gets slower again with more than 2 threads!

@PetrKryslUCSD:

What about if you replace propagates_splitting with a sleep to simulate work? Any difference?

Yeah, I tried that as “benchmark 5” in the test repo, and the scaling is perfect in that case.

So yes, the “problem” is what’s happening inside propagate_splitting, but my question is: why should it matter? It’s a self-contained black box function, accessesing non-overlapping data on each thread. It all should be completely independent.

The only thing I could imagine is something like saturating memory bandwidth… but that seems very bad Julia behavior for this kind of code. What I mean: I’ve never had a problem like that doing similar kinds of numerics by “trivially” parallelizing with OpenMP in Fortran.

I mean, maybe I need to go in and optimize allocations. They seem a bit high in the benchmark, but not outrageous. Although the point of the question here isn’t necessarily to optimize to propagation code, at this point. This is kinda throw-away “daily research” code, and I’m just trying to reduce the time I have to wait for some calculations by a few hours by utilizing the cores in my workstation. I do wish Julia had manual memory management like Fortran, though!

@giordano:

Perhaps it’s not a problem here, but in general don’t put @inbounds in front of @threads

The @inbounds actually don’t make any difference at all. I didn’t have any @inbounds initially and then added them later just to make sure. And then I left them in under the assumption that otherwise someone would say “Try adding @inbounds:wink:

@mkitti:

The most important part is assigning t_r outside of the inner loop

There could be some false sharing

Worth a try, but didn’t make a difference (results are in the test repo).

@jmair:

Just to get some more information, what processor are you using?

An Intel Core-i9 with 8 cores (16 hyperthreads) running Linux. See the “System Information” in the example repo for more details.

This is not what we call a “trivially parallel problem”.

Part of the issue here is that some of these downstream calls may themselves already be using parallel algorithms. For example, for matrix-matrix multiplication you are already using multiple threads to do that. You can check how many threads OpenBLAS is set to use.

julia> LinearAlgebra.BLAS.get_num_threads()
4

Now if you set the environment variable OPENBLAS_NUM_THREADS=1 or use LinearAlgebra.BLAS.set_num_threads(1) you might see more linear scaling.

@carstenbauer has an explanation of this here:

2 Likes

Sorry, “Embarrassingly parallel”, is the technical term I was looking for. What I mean is: I could do this by running multiple programs in parallel, writing their output to disk (I might test that out, even though it adds overhead on me having to organize everything). There is no communication between the different threads.

And yeah, Julia’s (and Python’s, for that matter) default of allowing multiple threads in BLAS etc. is one of the worst decisions, ever, IMO. I can’t express how much it drives me up the wall, and I really hope it changes in Julia 2.0. Code should never run multithreaded unless I ask for it! Alas, I have

export NUMEXPR_NUM_THREADS=1
export OPENBLAS_NUM_THREADS=1
export VECLIB_MAXIMUM_THREADS=1
export OMP_NUM_THREADS=1
export MKL_NUM_THREADS=1

in my .bashrc, and I do keep an eye on htop during these benchmarks to check that I’m using as many cores as I’m requesting at 100% utilization, and that there are no weird unexpected sub-threads.

without it how are you gonna multiply matrices together in an efficient way? (syntax wise)

can you set LinearAlgebra.BLAS.set_num_threads(1) explicitly?

and also, it sounds like your problem might be memory bounded not (entirely) CPU, which could explain why adding threads don’t speed up things linearly

default of allowing multiple threads in BLAS etc. is one of the worst decisions, ever,

without it how are you gonna multiply matrices together in an efficient way? (syntax wise)

I’m not saying multi-threaded BLAS shouldn’t exist. If there’s nothing better you can do, and you’re really only running one computation at a time, you might as well use the extra parallelization. Personally, I never want it: I can always parallelize at a much higher level, and (at least in Fortran), I’ve never had much trouble getting nearly linear speedups (if I use 8 cores, it’s 8 times faster).

Automatic threading should just not ever be a default: More than once, I’ve run into accidentally oversubscribing a cluster node because I didn’t set some (undocumented!) environment variable, and BLAS or something else decided to run multiple threads on top of my program which was already carefully crafted to exactly utilize the full cluster node. I need to have control over exactly how many processes/threads are running at all times!

can you set LinearAlgebra.BLAS.set_num_threads(1) explicitly?

Sure! I tried it, it has no effect.

and also, it sounds like your problem might be memory bounded not (entirely) CPU, which could explain why adding threads don’t speed up things linearly

I guess…? That’s the only thing that really comes to mind…

I did add one more benchmark where I’m parallelizing across multiple processes (with @distributed). Those results are much more promising:

  • 1 Process: 160 s
  • 2 Processes: 92 s
  • 4 Processes: 57 s
  • 8 Processes: 48 s

However, that result is extremely alarming: shared-memory parallelization of the same problem should always have less overhead than multi-process parallelization. So if there’s a memory bandwidth problem in the shared-memory @threads variant, it seems to me that Julia is seriously mismanaging memory access here.

In any case: how would one go about analyzing memory bounded issues? Is it just about the allocations? Is there some other way to see how memory is being accessed?

Just to boil it down to the basics: I’m essentially just trying to do a parallel map here, where I have 256 values x (tuples of two numbers, but it doesn’t really matter), and I’m just trying to do y = map(f, x). The f in this case is a relatively complicated function that takes about a second to run and does a fair amount of manipulating vectors, but it really shouldn’t matter: This kind of problem should always be perfectly parallelizable, almost no matter what f is.

One caveat would be if a single call to f saturates memory or I/O bandwidth, but I don’t really see how that would be happening here. Besides, I think that would be reflected in a reduced CPU utilization that would be visible in htop, no? Plus, it seems fine in the multi-process setup, which should be worse, if anything. So, very strange… :person_shrugging:

No, it wouldn’t be visible in htop. Stalled cores waiting on memory would still be at 100%.

I would recommend LIKWID, however.

3 Likes

I think the GC can get in the way of parallel threads and if your code is allocating a lot then all the threads have to stop to wait for the GC (someone who knows more can correct me if I’m mistaken).

Since you’ve switched to distributed and seen a much better improvement, this makes me think the GC is the issue bottlenecking your performance.

4 Likes

GC is not multi-threaded but is parallelized when you do multi-process. This makes sense and just means you need to reduce allocation in your workload

1 Like

I didn’t bring up GC because it didn’t look like a major problem in this comment:

Specifically:

:> JULIA_EXCLUSIVE=1 julia --project=. -t 1 benchmark2.jl
153.666410 seconds (290.36 M allocations: 410.962 GiB, 5.05% gc time, 0.04% compilation time)
:> JULIA_EXCLUSIVE=1 julia --project=. -t 2 benc
 98.435518 seconds (309.35 M allocations: 411.444 GiB, 6.31% gc time, 0.08% compilation time)
:> JULIA_EXCLUSIVE=1 julia --project=. -t 4 benchmark2.jl
 99.298586 seconds (333.82 M allocations: 412.088 GiB, 5.35% gc time, 0.08% compilation time)
:> JULIA_EXCLUSIVE=1 julia --project=. -t 8 benchmark2.jl
121.498563 seconds (374.74 M allocations: 413.286 GiB, 3.96% gc time, 0.07% compilation time)

Only 4% GC time with 8 threads.
I’ve often seen >80% gc time in multithreaded workloads when using 18 or more threads.
This is far away from those really problematic cases.

1 Like

you need to reduce allocation in your workload

Indeed, that’s what turned out to solve the problem.

There was just one routine that was sloppily written, which I was able to rewrite with an in-place version.

A big shoutout to TimerOutputs.jl, which was extremely helpful for the profiling!

With the reduced allocations, I’m now seeing perfectly linear scaling:

:> JULIA_EXCLUSIVE=1 julia --project=. -t 1 benchmark2.jl
  0.210143 seconds (33.80 k allocations: 1.917 MiB)
  0.221251 seconds (33.88 k allocations: 1.922 MiB, 1.29% compilation time)
 54.201312 seconds (8.73 M allocations: 494.694 MiB, 0.10% gc time, 0.07% compilation time)

:> JULIA_EXCLUSIVE=1 julia --project=. -t 2 benchmark2.jl
  0.212420 seconds (33.80 k allocations: 1.917 MiB)
  0.225637 seconds (33.88 k allocations: 1.922 MiB, 1.24% compilation time)
 27.806800 seconds (8.73 M allocations: 494.716 MiB, 0.24% gc time, 0.17% compilation time)

:> JULIA_EXCLUSIVE=1 julia --project=. -t 4 benchmark2.jl
  0.213204 seconds (33.80 k allocations: 1.917 MiB)
  0.215185 seconds (33.88 k allocations: 1.922 MiB, 1.33% compilation time)
 14.431739 seconds (8.73 M allocations: 494.777 MiB, 0.41% gc time, 0.34% compilation time)

:> JULIA_EXCLUSIVE=1 julia --project=. -t 8 benchmark2.jl
  0.207609 seconds (33.80 k allocations: 1.917 MiB)
  0.212293 seconds (33.88 k allocations: 1.922 MiB, 1.33% compilation time)
  8.424910 seconds (8.73 M allocations: 494.711 MiB, 0.48% compilation time)

So, problem solved!

The whole issue does leave a bad aftertaste, though: I really feel that Julia’s garbage collector should handle this situation much better. At this point, garbage collection is the number one worry I have about Julia’s efficacy. Maybe in future versions, (in addition to just fixing the GC in parallel threads), Julia could gain a built-in @deallocate for manual memory management, which would allow turning off garbage collection for large section of a program — although I understand that this inevitably opens up the door to potential segfaults.

Anyway, thanks again, everyone. This has been very instructive!

7 Likes

so as you can see the allocation turns out to be the problem?

I wonder if GC % is deceiving because it shows how much % of time GC used but it doesn’t account for the delay GC caused blocking other stuff from running.

I suspect that because GC is blocking further workload from running, the total time is increased, and the GC % doesn’t look as bad.

2 Likes