Julia multithreading is running slower than serial, can someone please explain why…?

nethaji8 · March 24, 2023, 7:17am

I am trying to parallelize an algorithm which looks like this. I couldn’t express my code right in my previous post

The following algorithm takes an element from an odd column and adds it to the adjacent even coloumn.

using BenchmarkTools, Random, StatsBase, Distributions

using FLoops, FoldsThreads, ThreadsX

function test()

    rows, coloumns = 990, 99990

    #########Initializing Matrix A####################################

    A = zeros(Int64, rows, coloumns)

    ########Filling only 75% of rows##################################
    entries :: Vector{Int} = sample(1:floor(Int,0.75*rows), coloumns; replace=true)

    for i in axes(A,2)
        A[1:entries[i],i] .= shuffle(1:entries[i])
    end

    #########Declaring Dummy Variables####################################

    dummy_A = deepcopy(A)
    dummy_entries = deepcopy(entries)

    # println(dummy_A')

    @floop for i in 1:2:coloumns-1
        index1 :: Int = rand(1:dummy_entries[i])
        index2 :: Int = rand(1:dummy_entries[i+1])

        temp :: Int = dummy_A[index1,i]

        ###############Removing an element from ith coloumn#########################

        for j in index1:dummy_entries[i]
            dummy_A[j,i] = dummy_A[j+1,i]
        end

        dummy_entries[i] = dummy_entries[i] - 1

        ###############Adding an element in i+1th coloumn#########################

        for j in dummy_entries[i+1]:-1:index2
            dummy_A[j+1,i+1] = dummy_A[j,i+1]
        end

        dummy_A[index2,i+1] = temp

        dummy_entries[i+1] = dummy_entries[i+1] + 1

        ###############Updating only if condition satisfies#########################
        if rand() <= 0.5

            entries[i], entries[i+1] = dummy_entries[i], dummy_entries[i+1]
            A[1:entries[i],i] .= dummy_A[1:entries[i],i]
            A[1:entries[i+1],i+1] .= dummy_A[1:entries[i+1],i+1]
        end
    end

    # println(dummy_A')
end

@benchmark begin
    test()
end

Results for the Serial Version

BenchmarkTools.Trial: 6 samples with 1 evaluation.
 Range (min … max):  784.253 ms … 877.906 ms  ┊ GC (min … max):  2.06% … 10.24%
 Time  (median):     871.206 ms               ┊ GC (median):    10.31%
 Time  (mean ± σ):   856.444 ms ±  36.009 ms  ┊ GC (mean ± σ):   9.12% ±  3.41%

  █                                               █       ██ ██
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁██▁██ ▁
  784 ms           Histogram: frequency by time          878 ms <

 Memory estimate: 1.91 GiB, allocs estimate: 149748.

Results for Parallel Version with 10 threads

BenchmarkTools.Trial: 6 samples with 1 evaluation.
 Range (min … max):  749.625 ms …    1.022 s  ┊ GC (min … max):  2.24% … 25.19%
 Time  (median):     886.070 ms               ┊ GC (median):    16.99%
 Time  (mean ± σ):   888.100 ms ± 114.955 ms  ┊ GC (mean ± σ):  16.82% ± 10.66%

  █        █     █                             █        █     █
  █▁▁▁▁▁▁▁▁█▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁█▁▁▁▁▁█ ▁
  750 ms           Histogram: frequency by time          1.02 s <

 Memory estimate: 1.91 GiB, allocs estimate: 149580.

Can someone please let me know where am I making a mistake…?

baggepinnen · March 24, 2023, 7:33am

This is likely your problem, you rarely gain much additional performance by threading code that allocates a lot of memory. Here are some lines that allocate memory that you could consider improving.

sample(1:floor(Int,0.75*rows), coloumns; replace=true)
shuffle(1:entries[i])
all deepcopy allocate
A[1:entries[i],i] .= dummy_A[1:entries[i],i] consider adding @view annotations here to avoid allocating the slices

Many functions for random sampling have a version that operates on pre-allocated memory. By making use of those you could likely reduce your allocations dramatically.

nethaji8 · March 24, 2023, 9:19am

Dear @baggepinnen anything outside the loop is not a concern for me, in the loop can you please find any issue…?

Using @view(A[1:entries[i],i]) .= @view(dummy_A[1:entries[i],i]) helped to decrease the allocations considerably.

Serial output

BenchmarkTools.Trial: 7 samples with 1 evaluation.
 Range (min … max):  766.348 ms … 806.889 ms  ┊ GC (min … max): 2.03% … 9.57%
 Time  (median):     790.882 ms               ┊ GC (median):    7.53%
 Time  (mean ± σ):   790.327 ms ±  14.373 ms  ┊ GC (mean ± σ):  7.09% ± 2.35%

  █                 █            █    █             █   █     █
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁█▁▁▁▁▁█ ▁
  766 ms           Histogram: frequency by time          807 ms <

 Memory estimate: 1.77 GiB, allocs estimate: 100002.

Parallel output

BenchmarkTools.Trial: 6 samples with 1 evaluation.
 Range (min … max):  725.253 ms … 995.283 ms  ┊ GC (min … max):  2.11% … 27.68%
 Time  (median):     858.366 ms               ┊ GC (median):    15.76%
 Time  (mean ± σ):   864.141 ms ± 125.954 ms  ┊ GC (mean ± σ):  16.56% ± 11.46%

  █       ██                                         █      █ █
  █▁▁▁▁▁▁▁██▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁█▁█ ▁
  725 ms           Histogram: frequency by time          995 ms <

 Memory estimate: 1.77 GiB, allocs estimate: 100113.

We can notice still parallel output is still slower than serial can you please let me know what could be the problem…?
Regarding random sampling functions operating on pre-allocated memory, can you please give me an example…?

nilshg · March 24, 2023, 9:25am

I’m not sure I understand your code but it seems to me that you’re doing very little work in each loop iteration so the parallel overhead might simply be larger than the work actually done in each iteration?

nethaji8 · March 24, 2023, 9:35am

say

dummy_A[:,1] = [1 2 3 4 5 0 0 0 0 0]'
dummy_A[:,2] = [5 4 3 2 1 0 0 0 0 0]'

I randomly select a non-zero element from column 1 and place it at a random place in column 2. The final result would be the following if the condition rand() <= 0.5 is satisfied

dummy_A[:,1] = [1 2 3 5 0 0 0 0 0 0]' # element 4 is selected
dummy_A[:,2] = [5 4 4 3 2 1 0 0 0 0 0]' # 4 is placed at index 3

I am not really convinced that the parallel overhead might simply be large @nilshg .

DanielVandH · March 24, 2023, 9:38am

anything outside the loop is not a concern for me

All that code is being included in your timing, though. If you don’t want to time it, you’ll need put it outside of the function and define your function with arguments, e.g. (I’ve not run this).

using FLoops, StatsBase, Random, BenchmarkTools
function test(dummy_entries, dummy_A, coloumns, ex)
    @floop ex for i in 1:2:coloumns-1
        index1 = rand(1:dummy_entries[i])
        index2 = rand(1:dummy_entries[i+1])
        temp = dummy_A[index1, i]
        for j in index1:dummy_entries[i]
            dummy_A[j, i] = dummy_A[j+1, i]
        end
        dummy_entries[i] = dummy_entries[i] - 1
        for j in dummy_entries[i+1]:-1:index2
            dummy_A[j+1, i+1] = dummy_A[j, i+1]
        end
        dummy_A[index2, i+1] = temp
        dummy_entries[i+1] = dummy_entries[i+1] + 1
        if rand() <= 0.5
            entries[i], entries[i+1] = dummy_entries[i], dummy_entries[i+1]
            A[1:entries[i], i] .= dummy_A[1:entries[i], i]
            A[1:entries[i+1], i+1] .= dummy_A[1:entries[i+1], i+1]
        end
    end
end
rows, coloumns = 990, 99990
A = zeros(Int64, rows, coloumns)
entries = sample(1:floor(Int, 0.75 * rows), coloumns; replace=true)
for i in axes(A, 2)
    A[1:entries[i], i] .= shuffle(1:entries[i])
end
dummy_A = deepcopy(A)
dummy_entries = deepcopy(entries)

I imagine most of what you’re seeing in your timing is this initialisation outside of the loop:

function test()
    rows, coloumns = 990, 99990
    A = zeros(Int64, rows, coloumns)
    entries = sample(1:floor(Int, 0.75 * rows), coloumns; replace=true)
    for i in axes(A, 2)
        A[1:entries[i], i] .= shuffle(1:entries[i])
    end
    dummy_A = deepcopy(A)
    dummy_entries = deepcopy(entries)
end 
@benchmark test()

julia> @benchmark test()
BenchmarkTools.Trial: 6 samples with 1 evaluation.
 Range (min … max):  830.651 ms … 948.136 ms  ┊ GC (min … max):  2.56% … 20.25%
 Time  (median):     928.466 ms               ┊ GC (median):    20.84%
 Time  (mean ± σ):   912.143 ms ±  42.364 ms  ┊ GC (mean ± σ):  16.69% ±  7.64%

  ▁                                     ▁           █  ▁      ▁  
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁█▁▁█▁▁▁▁▁▁█ ▁
  831 ms           Histogram: frequency by time          948 ms <

 Memory estimate: 1.77 GiB, allocs estimate: 100002.

nethaji8 · March 24, 2023, 10:03am

Yes, @DanielVandH I didn’t know how to use @bencmark like this

function main()
     @benchmark begin 
               test()
        end
end

or any other way to profile from inside a function definition.

In any case, my parallel version should take less time than the serial version regardless of how we use @benchmark right…?

nilshg · March 24, 2023, 10:07am

But unless I misread your posts it is taking less time? You show a 35-40ms difference in favour of the parallel version. I think Daniel’s point is that if most of your 750ms total runtime is actually taken up by the initialisation then parallelising the loop isn’t going to change overall runtime much.

DanielVandH · March 24, 2023, 10:16am

Your code is indeed faster with the parallel version - as @nilshg points out, your benchmarks actually show this. Here’s one way to see the correct benchmark, which again shows the 40ms difference.

using FLoops, StatsBase, Random, BenchmarkTools
function test(dummy_entries_dummy_A, coloumns, ex)
    dummy_A, dummy_entries = dummy_entries_dummy_A
    @floop ex for i in 1:2:coloumns-1
        index1 = rand(1:dummy_entries[i])
        index2 = rand(1:dummy_entries[i+1])
        temp = dummy_A[index1, i]
        for j in index1:dummy_entries[i]
            dummy_A[j, i] = dummy_A[j+1, i]
        end
        dummy_entries[i] = dummy_entries[i] - 1
        for j in dummy_entries[i+1]:-1:index2
            dummy_A[j+1, i+1] = dummy_A[j, i+1]
        end
        dummy_A[index2, i+1] = temp
        dummy_entries[i+1] = dummy_entries[i+1] + 1
        if rand() <= 0.5
            entries[i], entries[i+1] = dummy_entries[i], dummy_entries[i+1]
            A[1:entries[i], i] .= dummy_A[1:entries[i], i]
            A[1:entries[i+1], i+1] .= dummy_A[1:entries[i+1], i+1]
        end
    end
end
function get_data()
    rows, coloumns = 990, 99990
    A = zeros(Int64, rows, coloumns)
    entries = sample(1:floor(Int, 0.75 * rows), coloumns; replace=true)
    for i in axes(A, 2)
        A[1:entries[i], i] .= shuffle(1:entries[i])
    end
    dummy_A = deepcopy(A)
    dummy_entries = deepcopy(entries)
    return dummy_A, dummy_entries
end

par_time = @benchmark test(dummy_entries_dummy_A, $coloumns, $ThreadedEx()) setup = (dummy_entries_dummy_A=get_data()) evals = 1
ser_time = @benchmark test(dummy_entries_dummy_A, $coloumns, $SequentialEx()) setup = (dummy_entries_dummy_A=get_data()) evals = 1

par_time 
ser_time

julia> par_time
BenchmarkTools.Trial: 8 samples with 1 evaluation.
 Range (min … max):  14.980 ms … 20.065 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     16.672 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   17.255 ms ±  1.553 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▁                ▁▁█                 ▁ ▁                  ▁
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁███▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  15 ms           Histogram: frequency by time        20.1 ms <

 Memory estimate: 162.54 MiB, allocs estimate: 642452.

julia> ser_time
BenchmarkTools.Trial: 7 samples with 1 evaluation.
 Range (min … max):  72.562 ms … 125.290 ms  ┊ GC (min … max):  0.00% … 40.32%
 Time  (median):     89.215 ms               ┊ GC (median):    25.48%
 Time  (mean ± σ):   92.077 ms ±  15.997 ms  ┊ GC (mean ± σ):  25.40% ± 11.91%

  ▁               ▁ █ ▁▁                                     ▁
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁█▁██▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  72.6 ms         Histogram: frequency by time          125 ms <

 Memory estimate: 162.49 MiB, allocs estimate: 641492.

Paul_Soderlind · March 24, 2023, 10:23am

I understand why allocations make for slow code in general. But, how does that interact with threading? (Is memory handling a common thread, or something?)

jakobnissen · March 24, 2023, 10:28am

The garbage collector can be triggered by memory allocation in any one thread, and when the garbage collector kicks in, every thread is paused during the duration. So, as you add more threads, a larger and larger fraction of your time will be spent garbage collecting.
The solution to this is to implement a multithreaded GC (on the Julia dev side), and to reduce allocations (on the user side)

There may well be other reasons too that I don’t know about.

nethaji8 · March 24, 2023, 10:31am

Dear @DanielVandH and @nilshg seems like you are right. My actual code is somewhat like this

function test(arguments)

   @floop for i in 1:n
          . 
          .
          .
     end
end
function main()
    . 
    .

    test(arguements)
end

main()

whenever I am using @floop in the test() function my code is getting 3x slower. I am unable to figure out why, the algorithm is very similar to the one explained above, there are very fewer allocations happening inside the loop, and there is no data race still the code is getting slower. I am unable to figure out why.

DanielVandH · March 24, 2023, 10:36am

There could be many reasons, but just note that multithreading by itself is not guaranteed to give you faster code - threads take time to get started. Try making sure you get the most performance out of your serial code first and then timing how much a single iteration of your loop actually takes. If the loop is quick, I doubt multithreading will help.

baggepinnen · March 24, 2023, 10:48am

@jakobnissen already gave a very good answer to this. You can often notice the difference of a multi-threaded GC by comparing the same code when run using @threads and when it’s run using Distributed. In the latter case, Julia will use additional workers, each of which is its own julia instance and thus has its own GC, and this can often outperform threads in the presence of lots of allocations.

jmair · March 24, 2023, 10:03pm

Aside from overhead from launching and synchronising threads, running a loop in parallel can also changes the order in which the loop is executed, changing the order of memory reads/writes which can break any automatic SIMD that the compiler does. This can usually be remedied by splitting the loop into contiguous chunks using Iterators.partition or a package like ChunkSplitters.jl.

Additionally, if the elements that you write/read to in an array are too close together, you may also suffer from false sharing, which can drastically slow down your code when using multiple threads.

In short, there are many possible causes of performance degradation with multithreading, and it’s really hard to know what is causing the problem without looking at the real code and experimenting.

Topic		Replies	Views
Poor performance while multithreading (Julia 1.0) Performance multithreading	28	4036	February 11, 2019
Multithreading an embarrassingly parallel algorithm increases garbage collection Performance multithreading , memory , memory-allocation , garbage-collection	12	2000	March 1, 2021
Poor performance on cluster multithreading Performance performance , parallel , multithreading , cluster	40	4239	July 11, 2018
Julia multithreading is running slower than serial, can someone please explain why...? Performance multithreading , floops	15	1503	March 3, 2023
Multi-threading on a 2 CPU system New to Julia multithreading	15	1151	February 2, 2023

Julia multithreading is running slower than serial, can someone please explain why…?

Related topics