Embarrassingly parallel multi-threading doesn't scale

A simple embarrassingly parallel code, with 4 threads, only halves in speed.
Here is a super simplified case

:~$ julia -t4

julia> N = Int(1e8)
100000000

julia> @elapsed for i in 1:N
       rand(100)
       end
46.246816922

julia> @elapsed Threads.@threads for i in 1:N
       rand(100)
       end
24.132724014

julia> Threads.nthreads()
4

Speed-up is only ~2! How is this possible!?
I’ve tried different numbers of threads, different numbers N.
I’ve tried on Jupyter notebooks and with other embarrassingly parallel tasks, but I keep getting the same thing.
I’ve also tried @belapsed and @btime to make sure this wasn’t a one time thing.
I have a Thinkpad T14 with Ubuntu 20.04, AMD Ryzen 7 with 8 cores (16 with multi-threading).

I thought maybe it says it’s using 4 threads, but it’s actually just 2 cores with double threading, but I don’t know how to test that.
Any ideas?
Thanks

We had as similar question recently: Loosing performance with `Threads.@threads` for loop - #6 by Kevin_Kleinbeck

Maybe you oversaturate your machine?

The rand(100) is probably too fast, and also requires a lot of memory allocations, such that the whole computation is probably bounded by memory accesses. These very simple benchmarks usually do not measure anything meaningful in terms of threading performance. In particular, in real applications, you should probably try to avoid allocations as much as possible, exactly to avoid concurrency on memory access. For example, this is probably already better:

julia> N = 10^5 #  :-) 

julia> @btime for i in 1:$N
           for j in 1:100
               rand()
           end
       end
  31.868 ms (0 allocations: 0 bytes)

julia> @btime Threads.@threads for i in 1:$N
           for j in 1:100
               rand()
           end
       end
  10.256 ms (21 allocations: 1.84 KiB)

(and rand() is a tricky function to include in these benchmarks, because it may have to access a global shared variable to produce the sequence of numbers).

5 Likes

Hi Leandro,

does the thread scheduling in Julia automatically consider the number of available cores? Even if so, if you start 10000 thread at the same time, isn’ t there some kind of thread switching overhead to expect?

These things are lowered roughly in the same way:

julia> using BenchmarkTools

julia> N = 10^5
100000

julia> @btime Threads.@threads for i in 1:$N
           for j in 1:100
               rand()
           end
       end
  10.329 ms (21 allocations: 1.86 KiB)

julia> @btime Threads.@threads for i in 1:4
           for k in 1:$N÷4
               for j in 1:100
                   rand()
               end
           end
       end
  10.338 ms (21 allocations: 1.86 KiB)

1 Like

No, you need to start Julia with the number of threads you want to use in the session: Multi-Threading · The Julia Language. There is the option to automatically use the number of available threads in the system (-t auto).

2 Likes

Thanks for the replies.
Ok this is helpful, but a bit disappointing since in actual real cases I am using rand() a lot and accessing memory in every loop to save the results.

Maybe I’d be better off with Distributed.jl for embarrassingly parallel tasks, then there’s no memory concurrency

I don’t think that was the question.

if you launch julia with 4 threads (julia -t4), will it be on 4 different cores (assuming your machine has 4 or more cores), or might it be on two cores double-threading?
That would explain why performance is only halved

To have good scaling the work done on each thread must be enough to compensate the overheads associated to threading. Concerning the memory accesses, you can allocate separate arrays for each thread to store results. If you have an actual application code that you want to scale better, maybe you can ask for help for that specifically, the optimal solutions is certainly dependent on the calculations involved.

3 Likes

First of all don’t benchmark loops in the global scope. It’s really bad for Julia’s performance model. Now, regarding the runtime performance, a lot of this is because of allocation heavy code. You should be trying to reduce allocations in your multithreaded functions by reusing buffers, and if you can’t you should consider using multiprocessing instead.

julia> f(N) = for i in 1:N
           rand(100)
       end
f (generic function with 1 method)

julia> g(N) = Threads.@threads for i in 1:N
           rand(100)
       end
g (generic function with 1 method)

julia> using Distributed; addprocs(3)
3-element Vector{Int64}:
 2
 3
 4

julia> h(N) = @sync @distributed for i in 1:N
           rand(100)
       end
h (generic function with 1 method)
 N = Int(1e8)

julia> @elapsed f(N)
23.049963313

julia> @elapsed g(N)
18.569370371

julia> @elapsed h(N)
9.839405454

If I understand correctly, every time the GC runs, it has to pause all the threads currently which is slowing down the multithreaded code a lot. If you show us something more realistic, it’s possible we can help you eliminate your allocations.

3 Likes

For example, here’s how I’d cut out the allocations from your example:

julia> using Random: rand!

julia> function i(N)
           buffers = [Vector{Float64}(undef, 100) for _ ∈ 1:Threads.nthreads()]
           Threads.@threads for i in 1:N
               rand!(buffers[Threads.threadid()])
           end
       end
i (generic function with 1 method)

julia> @belapsed i(N)
3.908675333

Here, I preallocate one vector per thread, and then inside the loop, I call rand! on that preallocated array, mutating it in place with random numbers instead of creating a new array.

4 Likes

I’d recommend avoiding buffers[Threads.threadid()] pattern since it’s very hard to know when it’s OK to use it: FAQ · FLoops

4 Likes

I actually thought of that as I wrote it, but didn’t have ti to chase down the reference and recommended alternative. Thanks for the link.

2 Likes

Shouldn’t this example be fine though, at least if using Threads.@threads :static?
Because buffers was allocated inside i, no external tasks will have access to it.
I’m not sure if the static scheduler really guarantees no task migration, though.

Better would probably be to just @spawn a task per thread, and have these tasks reuse a local buffer that they initialize themselves. Each of these could also then explicitly manage their local RNG if necessary/desired.

AFAIK, it does. (@vchuravy)

@threads does currently set the tasks to sticky, but would be good to know if this is an implementation detail or semantically guaranteed. The doc string didn’t mention it.

2 Likes

My understanding is that it is an implementation detail for @threads but guaranteed for @threads :static. But I agree, it would be great to make this official in the docs / doc string.

Yeah, I agree. But the code posted in Discourse will unlikely be used as-is. It’ll be modified “a little bit” by people with various levels of knowledge. It’s possible that people swap Threads.@threads for with something else (e.g., direct @spawn) and put (say) some print-based debug statements.

So, I think recommending buffers[threadid()] pattern needs an explanation of why it’s OK here and when it can break. But it’s tedious. So, for simplicity, by default, I prefer to discourage this and recommend a simple generic way. It’s useful sometimes ATM but I think it’s a very implementation-dependent and expert-only pattern.

6 Likes