Embarrassingly parallel multi-threading doesn't scale

Arnau · October 15, 2021, 5:12pm

A simple embarrassingly parallel code, with 4 threads, only halves in speed.
Here is a super simplified case

:~$ julia -t4

julia> N = Int(1e8)
100000000

julia> @elapsed for i in 1:N
       rand(100)
       end
46.246816922

julia> @elapsed Threads.@threads for i in 1:N
       rand(100)
       end
24.132724014

julia> Threads.nthreads()
4

Speed-up is only ~2! How is this possible!?
I’ve tried different numbers of threads, different numbers N.
I’ve tried on Jupyter notebooks and with other embarrassingly parallel tasks, but I keep getting the same thing.
I’ve also tried @belapsed and @btime to make sure this wasn’t a one time thing.
I have a Thinkpad T14 with Ubuntu 20.04, AMD Ryzen 7 with 8 cores (16 with multi-threading).

I thought maybe it says it’s using 4 threads, but it’s actually just 2 cores with double threading, but I don’t know how to test that.
Any ideas?
Thanks

goerch · October 15, 2021, 5:16pm

We had as similar question recently: Loosing performance with `Threads.@threads` for loop - #6 by Kevin_Kleinbeck

Maybe you oversaturate your machine?

lmiq · October 15, 2021, 5:34pm

The rand(100) is probably too fast, and also requires a lot of memory allocations, such that the whole computation is probably bounded by memory accesses. These very simple benchmarks usually do not measure anything meaningful in terms of threading performance. In particular, in real applications, you should probably try to avoid allocations as much as possible, exactly to avoid concurrency on memory access. For example, this is probably already better:

julia> N = 10^5 #  :-) 

julia> @btime for i in 1:$N
           for j in 1:100
               rand()
           end
       end
  31.868 ms (0 allocations: 0 bytes)

julia> @btime Threads.@threads for i in 1:$N
           for j in 1:100
               rand()
           end
       end
  10.256 ms (21 allocations: 1.84 KiB)

(and rand() is a tricky function to include in these benchmarks, because it may have to access a global shared variable to produce the sequence of numbers).

goerch · October 15, 2021, 6:01pm

Hi Leandro,

does the thread scheduling in Julia automatically consider the number of available cores? Even if so, if you start 10000 thread at the same time, isn’ t there some kind of thread switching overhead to expect?

lmiq · October 15, 2021, 6:22pm

These things are lowered roughly in the same way:

julia> using BenchmarkTools

julia> N = 10^5
100000

julia> @btime Threads.@threads for i in 1:$N
           for j in 1:100
               rand()
           end
       end
  10.329 ms (21 allocations: 1.86 KiB)

julia> @btime Threads.@threads for i in 1:4
           for k in 1:$N÷4
               for j in 1:100
                   rand()
               end
           end
       end
  10.338 ms (21 allocations: 1.86 KiB)

giordano · October 15, 2021, 6:27pm

No, you need to start Julia with the number of threads you want to use in the session: Multi-Threading · The Julia Language. There is the option to automatically use the number of available threads in the system (-t auto).

Arnau · October 15, 2021, 6:29pm

Thanks for the replies.
Ok this is helpful, but a bit disappointing since in actual real cases I am using rand() a lot and accessing memory in every loop to save the results.

Maybe I’d be better off with Distributed.jl for embarrassingly parallel tasks, then there’s no memory concurrency

Arnau · October 15, 2021, 6:32pm

I don’t think that was the question.

if you launch julia with 4 threads (julia -t4), will it be on 4 different cores (assuming your machine has 4 or more cores), or might it be on two cores double-threading?
That would explain why performance is only halved

lmiq · October 15, 2021, 6:36pm

To have good scaling the work done on each thread must be enough to compensate the overheads associated to threading. Concerning the memory accesses, you can allocate separate arrays for each thread to store results. If you have an actual application code that you want to scale better, maybe you can ask for help for that specifically, the optimal solutions is certainly dependent on the calculations involved.

Mason · October 15, 2021, 9:13pm

First of all don’t benchmark loops in the global scope. It’s really bad for Julia’s performance model. Now, regarding the runtime performance, a lot of this is because of allocation heavy code. You should be trying to reduce allocations in your multithreaded functions by reusing buffers, and if you can’t you should consider using multiprocessing instead.

julia> f(N) = for i in 1:N
           rand(100)
       end
f (generic function with 1 method)

julia> g(N) = Threads.@threads for i in 1:N
           rand(100)
       end
g (generic function with 1 method)

julia> using Distributed; addprocs(3)
3-element Vector{Int64}:
 2
 3
 4

julia> h(N) = @sync @distributed for i in 1:N
           rand(100)
       end
h (generic function with 1 method)

 N = Int(1e8)

julia> @elapsed f(N)
23.049963313

julia> @elapsed g(N)
18.569370371

julia> @elapsed h(N)
9.839405454

If I understand correctly, every time the GC runs, it has to pause all the threads currently which is slowing down the multithreaded code a lot. If you show us something more realistic, it’s possible we can help you eliminate your allocations.

Mason · October 15, 2021, 9:22pm

For example, here’s how I’d cut out the allocations from your example:

julia> using Random: rand!

julia> function i(N)
           buffers = [Vector{Float64}(undef, 100) for _ ∈ 1:Threads.nthreads()]
           Threads.@threads for i in 1:N
               rand!(buffers[Threads.threadid()])
           end
       end
i (generic function with 1 method)

julia> @belapsed i(N)
3.908675333

Here, I preallocate one vector per thread, and then inside the loop, I call rand! on that preallocated array, mutating it in place with random numbers instead of creating a new array.

tkf · October 15, 2021, 9:49pm

I’d recommend avoiding buffers[Threads.threadid()] pattern since it’s very hard to know when it’s OK to use it: FAQ · FLoops

Mason · October 16, 2021, 12:19am

I actually thought of that as I wrote it, but didn’t have ti to chase down the reference and recommended alternative. Thanks for the link.

Elrod · October 16, 2021, 4:27am

Shouldn’t this example be fine though, at least if using Threads.@threads :static?
Because buffers was allocated inside i, no external tasks will have access to it.
I’m not sure if the static scheduler really guarantees no task migration, though.

Better would probably be to just @spawn a task per thread, and have these tasks reuse a local buffer that they initialize themselves. Each of these could also then explicitly manage their local RNG if necessary/desired.

carstenbauer · October 16, 2021, 8:12am

AFAIK, it does. (@vchuravy)

Elrod · October 16, 2021, 9:15am

@threads does currently set the tasks to sticky, but would be good to know if this is an implementation detail or semantically guaranteed. The doc string didn’t mention it.

carstenbauer · October 16, 2021, 9:16am

My understanding is that it is an implementation detail for @threads but guaranteed for @threads :static. But I agree, it would be great to make this official in the docs / doc string.

tkf · October 16, 2021, 10:09pm

Yeah, I agree. But the code posted in Discourse will unlikely be used as-is. It’ll be modified “a little bit” by people with various levels of knowledge. It’s possible that people swap Threads.@threads for with something else (e.g., direct @spawn) and put (say) some print-based debug statements.

So, I think recommending buffers[threadid()] pattern needs an explanation of why it’s OK here and when it can break. But it’s tedious. So, for simplicity, by default, I prefer to discourage this and recommend a simple generic way. It’s useful sometimes ATM but I think it’s a very implementation-dependent and expert-only pattern.

Topic		Replies	Views
Loosing performance with `Threads.@threads` for loop Performance parallel , multithreading , threads	10	673	October 7, 2021
How to speed up tasks? Performance	8	795	September 23, 2020
Multi-threading doesn't seem to scale well on this counting problem Performance multithreading	15	2136	January 14, 2018
Parallel computing with * Performance question	27	1096	December 29, 2022
Same multi-threaded code, scaling observed only on some machines Performance	2	67	August 14, 2024

Embarrassingly parallel multi-threading doesn't scale

Related topics