Julia multithreading is running slower than serial, can someone please explain why...?

I am trying to parallelize an algorithm which looks like this
I initialized a vector of vectors(50000*1000), in the loop I shifted a random element from an odd-numbered vector into an adjacent even vector at a random place.

for example if A[1] = [1,2,3]
A[2] = [4,5,6]

after running we get
A[1] = [1,3]
A[2] = [4,5,6,2]
using FLoops, FoldsThreads, ThreadsX, StatsBase, BenchmarkTools

function main()

    num ::Int = 50000
    A = [[sample(collect(1:num), 1000; replace=false)] for i in 1:num]

    @floop for i in 1:2:num
        truck1 = i
        truck2 = i+1 

        truck1_stops = copy(A[truck1])
        truck2_stops = copy(A[truck2])

        temp_index1 = rand(collect(1:length(truck1_stops)))
        temp_index2 = rand(collect(1:length(truck2_stops)))

        insert!(truck2_stops, temp_index2, truck1_stops[temp_index1])

        splice!(truck1_stops, temp_index1)

        A[truck1] = copy(truck1_stops)
        A[truck2] = copy(truck2_stops)
    end

end

@btime main()

Threads.nthreads()

Output of the parallel version is

 18.844 s (725452 allocations: 19.91 GiB)

Output of the serial version is

9.394 s (725410 allocations: 19.91 GiB)

Please let me know why my parallel code is taking longer than my serial code.

The situation is worse in my actual code, the parallel version is taking at least 10x more time than the serial one and I couldn’t figure out why.

Thank you in advance…!

Hey there!

So I looked a bit at your code, and it has a lot of memory allocations that may prevent threads from working well in parallel. Before using multithreading, it’s always a good idea to squeeze every inch of performance out of the serial code. A great source for that is Julia’s performance tips.

For reference, here is the performance of your code on my computer:

julia> @benchmark yourmain()
BenchmarkTools.Trial: 2 samples with 1 evaluation.
 Range (min … max):  2.947 s …    3.277 s  β”Š GC (min … max):  8.43% … 14.76%
 Time  (median):     3.112 s               β”Š GC (median):    11.76%
 Time  (mean Β± Οƒ):   3.112 s Β± 232.697 ms  β”Š GC (mean Β± Οƒ):  11.76% Β±  4.47%

  β–ˆ                                                        β–ˆ  
  β–ˆβ–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–ˆ ▁
  2.95 s         Histogram: frequency by time         3.28 s <

 Memory estimate: 19.91 GiB, allocs estimate: 725338.

julia> @benchmark yourmain_parallel()
BenchmarkTools.Trial: 2 samples with 1 evaluation.
 Range (min … max):  3.027 s …   3.068 s  β”Š GC (min … max): 4.16% … 11.03%
 Time  (median):     3.047 s              β”Š GC (median):    7.62%
 Time  (mean Β± Οƒ):   3.047 s Β± 28.451 ms  β”Š GC (mean Β± Οƒ):  7.62% Β±  4.86%

  β–ˆ                                                       β–ˆ  
  β–ˆβ–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–ˆ ▁
  3.03 s         Histogram: frequency by time        3.07 s <

 Memory estimate: 19.91 GiB, allocs estimate: 725427.

From what I can see, most of the allocations in there (calls to copy or collect) are actually not needed. Besides, it seems that your A does not have the right format (remove the [...] around sample?). Here’s the first version I tried:

function main(num=50000)
    stops = [sample(1:num, 1000; replace=false) for _ in 1:num]
    for i in 1:2:num
        t1, t2 = i, i + 1
        t1_stops, t2_stops = stops[t1], stops[t2]
        j1, j2 = rand(1:length(t1_stops)), rand(1:length(t2_stops))
        insert!(t2_stops, j2, t1_stops[j1])
        splice!(t1_stops, j1)
    end
end

Let’s see how it compares to yours (every time, the _parallel counterpart just has a @floop macro in front of the for loop):

julia> @benchmark main()
BenchmarkTools.Trial: 4 samples with 1 evaluation.
 Range (min … max):  1.329 s …   1.534 s  β”Š GC (min … max): 4.05% … 9.71%
 Time  (median):     1.500 s              β”Š GC (median):    9.72%
 Time  (mean Β± Οƒ):   1.466 s Β± 94.484 ms  β”Š GC (mean Β± Οƒ):  8.43% Β± 2.84%

  β–ˆ                                       β–ˆ             β–ˆ β–ˆ  
  β–ˆβ–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–ˆβ–β–β–β–β–β–β–β–β–β–β–β–β–β–ˆβ–β–ˆ ▁
  1.33 s         Histogram: frequency by time        1.53 s <

 Memory estimate: 1.67 GiB, allocs estimate: 425322.

julia> @benchmark main_parallel()
BenchmarkTools.Trial: 4 samples with 1 evaluation.
 Range (min … max):  1.163 s …    1.401 s  β”Š GC (min … max):  1.34% … 14.10%
 Time  (median):     1.390 s               β”Š GC (median):    12.37%
 Time  (mean Β± Οƒ):   1.336 s Β± 115.509 ms  β”Š GC (mean Β± Οƒ):  10.42% Β±  5.87%

  β–ˆ                                                     β–ˆ β–ˆβ–ˆ  
  β–ˆβ–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–ˆβ–β–ˆβ–ˆ ▁
  1.16 s         Histogram: frequency by time          1.4 s <

 Memory estimate: 1.67 GiB, allocs estimate: 425427.

Not bad, right? Still, the speedup obtained by multithreading is not great, so I decided to profile my function. It turns out that most of the time is spent initializing the big array, which we do not parallelize. So I wrote another version which modifies an existing array instead.

function init(num=50000)
    stops = [sample(1:num, 1000; replace=false) for _ in 1:num]
    return stops
end

function main!(stops)
    for i in 1:2:length(stops)
        t1, t2 = i, i + 1
        t1_stops, t2_stops = stops[t1], stops[t2]
        j1, j2 = rand(1:length(t1_stops)), rand(1:length(t2_stops))
        insert!(t2_stops, j2, t1_stops[j1])
        splice!(t1_stops, j1)
    end
end

Let’s run another benchmark!

julia> @benchmark main!(stops) setup=(stops=init())
BenchmarkTools.Trial: 4 samples with 1 evaluation.
 Range (min … max):  36.784 ms … 73.134 ms  β”Š GC (min … max):  0.00% … 48.15%
 Time  (median):     44.647 ms              β”Š GC (median):    14.84%
 Time  (mean Β± Οƒ):   49.803 ms Β± 17.156 ms  β”Š GC (mean Β± Οƒ):  24.33% Β± 23.17%

  β–ˆ                        ▁                                ▁  
  β–ˆβ–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–ˆβ–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–ˆ ▁
  36.8 ms         Histogram: frequency by time        73.1 ms <

 Memory estimate: 411.99 MiB, allocs estimate: 25000.

julia> @benchmark main_parallel!(stops) setup=(stops=init())
BenchmarkTools.Trial: 4 samples with 1 evaluation.
 Range (min … max):  29.238 ms … 157.372 ms  β”Š GC (min … max):  0.00% … 80.41%
 Time  (median):     29.898 ms               β”Š GC (median):     0.00%
 Time  (mean Β± Οƒ):   61.602 ms Β±  63.848 ms  β”Š GC (mean Β± Οƒ):  51.35% Β± 40.20%

  β–ˆ                                                             
  β–ˆβ–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–‡ ▁
  29.2 ms         Histogram: frequency by time          157 ms <

 Memory estimate: 412.00 MiB, allocs estimate: 25111.

Unsurprisingly this is much faster, but still no benefits to multithreading. That is probably because array insertions and deletions still use too much memory. My best advice would be to take advantage of bounded array size and circumvent allocations that way.

All of this was run with 12 threads.

I see this again and again.
Where do people keep getting the idea that this is anything other than terrible?

Can we improve the documentation of something, somewhere?
It’s confusing how or why this keeps happening.

2 Likes

I would assume they just don’t realize that you can use a unit range directly? And also probably don’t realize there’s a single function to provide each index.

I think it’s a natural line to write if you’re starting out and missing some pieces.

1 Like

I think a fair bit of poking around and reading has to happen, but something must go horribly wrong in that process, to result in collect.

If they were totally naive, they’d just write 1:length(truck1_stops), perhaps expecting it to return a vector like in R.
They have to actually inspect the result, and notice it is something different, but not that this different thing is still an AbstractVector that supports most anything they’d want to do with a Vector, except mutate it. Yet they’d have to find out about collect, somehow?

It’s like they all reach the same local suboptima, a point worse than ignorance or knowledge, and somehow stay there long enough to write code and even worry about its performance.
When minimizing objective functions, you don’t normally get stuck on local maxima…

1 Like

If you have in your head β€œI need an array, I’m not even going to try a not-array because why would that work, how do I get an array”, you end up there.

1 Like

Why think they need an array?

All this is highly non-obvious, and the interactive nature of Julia allows one to inspect the result, something that does not happen in other languages.

I don’t know about every case, but one might be if you’re coming from a stricter-typed language without Julia’s flexibility. Maybe they saw an example where an array was used but just didn’t see an example with a unit range. Idk, people have all kinds of baked-in assumptions and knowledge gaps. And I think one of the missing pieces is realizing that Julia is pretty permissive and most of the time you can use ranges or generators just as easily as instantiating an array.

Obviously you’re right in that people keep making this mistake.

I think one of the missing pieces is realizing that Julia is pretty permissive and most of the time you can use ranges or generators just as easily as instantiating an array.

I’d also probably want to emphasize that ranges are AbstractArrays.
I don’t think I’ve seen people calling collect on an SVector to iterate or call rand on it.

I think that won’t mean anything in this context. If the purpose is to improve docs, some examples and simple explanations of behavior are necessary. (Although I’m unsure if one would stop commiting the error because a docs page explains it, the exploratory nature of programming just may take us there - in my learning path I did commit all kinds of errors, and only after that I realize what the docs meant).

1 Like

Dear @gdalle thank you for your answer, I am a beginner in Julia, and my original code has copy operations which I can’t avoid sadly I couldn’t express those lines in this example. :slightly_smiling_face:

Dear @Elrod , I use collect because [:] doesn’t work sometimes.
for example A = collect(1:5) .+ 1 works but A = [1:5] .+ 1. I’m a beginner in julia please excuse me if my questions are elementary :slightly_smiling_face:

1 Like

You can simply do (1:5) .+ 1, treating the range object as a normal vector as mentioned above. Note that here () is important! Otherwise, you will get 1:(5 .+ 1) which is not want you want.

That’s nice @liuyxpp I’ll write this way from now.

I cannot think of unavoidable copy operations in an inner loop. You should always be able to pre-allocate your arrays outside of the loop.

Can you share an example where you cannot avoid to copy an array?