Julia multithreading is running slower than serial, can someone please explain why…?

I am trying to parallelize an algorithm which looks like this. I couldn’t express my code right in my previous post

The following algorithm takes an element from an odd column and adds it to the adjacent even coloumn.

using BenchmarkTools, Random, StatsBase, Distributions

using FLoops, FoldsThreads, ThreadsX

function test()

    rows, coloumns = 990, 99990

    #########Initializing Matrix A####################################

    A = zeros(Int64, rows, coloumns)

    ########Filling only 75% of rows##################################
    entries :: Vector{Int} = sample(1:floor(Int,0.75*rows), coloumns; replace=true)

    for i in axes(A,2)
        A[1:entries[i],i] .= shuffle(1:entries[i])
    end

    #########Declaring Dummy Variables####################################

    dummy_A = deepcopy(A)
    dummy_entries = deepcopy(entries)

    # println(dummy_A')

    @floop for i in 1:2:coloumns-1
        index1 :: Int = rand(1:dummy_entries[i])
        index2 :: Int = rand(1:dummy_entries[i+1])

        temp :: Int = dummy_A[index1,i]

        ###############Removing an element from ith coloumn#########################

        for j in index1:dummy_entries[i]
            dummy_A[j,i] = dummy_A[j+1,i]
        end

        dummy_entries[i] = dummy_entries[i] - 1

        ###############Adding an element in i+1th coloumn#########################

        for j in dummy_entries[i+1]:-1:index2
            dummy_A[j+1,i+1] = dummy_A[j,i+1]
        end

        dummy_A[index2,i+1] = temp

        dummy_entries[i+1] = dummy_entries[i+1] + 1

        ###############Updating only if condition satisfies#########################
        if rand() <= 0.5

            entries[i], entries[i+1] = dummy_entries[i], dummy_entries[i+1]
            A[1:entries[i],i] .= dummy_A[1:entries[i],i]
            A[1:entries[i+1],i+1] .= dummy_A[1:entries[i+1],i+1]
        end
    end

    # println(dummy_A')
end

@benchmark begin
    test()
end

Results for the Serial Version

BenchmarkTools.Trial: 6 samples with 1 evaluation.
 Range (min … max):  784.253 ms … 877.906 ms  ┊ GC (min … max):  2.06% … 10.24%
 Time  (median):     871.206 ms               ┊ GC (median):    10.31%
 Time  (mean ± σ):   856.444 ms ±  36.009 ms  ┊ GC (mean ± σ):   9.12% ±  3.41%

  █                                               █       ██ ██
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁██▁██ ▁
  784 ms           Histogram: frequency by time          878 ms <

 Memory estimate: 1.91 GiB, allocs estimate: 149748.

Results for Parallel Version with 10 threads

BenchmarkTools.Trial: 6 samples with 1 evaluation.
 Range (min … max):  749.625 ms …    1.022 s  ┊ GC (min … max):  2.24% … 25.19%
 Time  (median):     886.070 ms               ┊ GC (median):    16.99%
 Time  (mean ± σ):   888.100 ms ± 114.955 ms  ┊ GC (mean ± σ):  16.82% ± 10.66%

  █        █     █                             █        █     █
  █▁▁▁▁▁▁▁▁█▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁█▁▁▁▁▁█ ▁
  750 ms           Histogram: frequency by time          1.02 s <

 Memory estimate: 1.91 GiB, allocs estimate: 149580.

Can someone please let me know where am I making a mistake…?

This is likely your problem, you rarely gain much additional performance by threading code that allocates a lot of memory. Here are some lines that allocate memory that you could consider improving.

  • sample(1:floor(Int,0.75*rows), coloumns; replace=true)
  • shuffle(1:entries[i])
  • all deepcopy allocate
  • A[1:entries[i],i] .= dummy_A[1:entries[i],i] consider adding @view annotations here to avoid allocating the slices

Many functions for random sampling have a version that operates on pre-allocated memory. By making use of those you could likely reduce your allocations dramatically.

3 Likes

Dear @baggepinnen anything outside the loop is not a concern for me, in the loop can you please find any issue…?

  • Using @view(A[1:entries[i],i]) .= @view(dummy_A[1:entries[i],i]) helped to decrease the allocations considerably.

Serial output

BenchmarkTools.Trial: 7 samples with 1 evaluation.
 Range (min … max):  766.348 ms … 806.889 ms  ┊ GC (min … max): 2.03% … 9.57%
 Time  (median):     790.882 ms               ┊ GC (median):    7.53%
 Time  (mean ± σ):   790.327 ms ±  14.373 ms  ┊ GC (mean ± σ):  7.09% ± 2.35%

  █                 █            █    █             █   █     █
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁█▁▁▁▁▁█ ▁
  766 ms           Histogram: frequency by time          807 ms <

 Memory estimate: 1.77 GiB, allocs estimate: 100002.

Parallel output

BenchmarkTools.Trial: 6 samples with 1 evaluation.
 Range (min … max):  725.253 ms … 995.283 ms  ┊ GC (min … max):  2.11% … 27.68%
 Time  (median):     858.366 ms               ┊ GC (median):    15.76%
 Time  (mean ± σ):   864.141 ms ± 125.954 ms  ┊ GC (mean ± σ):  16.56% ± 11.46%

  █       ██                                         █      █ █
  █▁▁▁▁▁▁▁██▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁█▁█ ▁
  725 ms           Histogram: frequency by time          995 ms <

 Memory estimate: 1.77 GiB, allocs estimate: 100113.
  • We can notice still parallel output is still slower than serial can you please let me know what could be the problem…?
  • Regarding random sampling functions operating on pre-allocated memory, can you please give me an example…?

I’m not sure I understand your code but it seems to me that you’re doing very little work in each loop iteration so the parallel overhead might simply be larger than the work actually done in each iteration?

1 Like

say

dummy_A[:,1] = [1 2 3 4 5 0 0 0 0 0]'
dummy_A[:,2] = [5 4 3 2 1 0 0 0 0 0]'

I randomly select a non-zero element from column 1 and place it at a random place in column 2. The final result would be the following if the condition rand() <= 0.5 is satisfied

dummy_A[:,1] = [1 2 3 5 0 0 0 0 0 0]' # element 4 is selected
dummy_A[:,2] = [5 4 4 3 2 1 0 0 0 0 0]' # 4 is placed at index 3

I am not really convinced that the parallel overhead might simply be large @nilshg .

anything outside the loop is not a concern for me

All that code is being included in your timing, though. If you don’t want to time it, you’ll need put it outside of the function and define your function with arguments, e.g. (I’ve not run this).

using FLoops, StatsBase, Random, BenchmarkTools
function test(dummy_entries, dummy_A, coloumns, ex)
    @floop ex for i in 1:2:coloumns-1
        index1 = rand(1:dummy_entries[i])
        index2 = rand(1:dummy_entries[i+1])
        temp = dummy_A[index1, i]
        for j in index1:dummy_entries[i]
            dummy_A[j, i] = dummy_A[j+1, i]
        end
        dummy_entries[i] = dummy_entries[i] - 1
        for j in dummy_entries[i+1]:-1:index2
            dummy_A[j+1, i+1] = dummy_A[j, i+1]
        end
        dummy_A[index2, i+1] = temp
        dummy_entries[i+1] = dummy_entries[i+1] + 1
        if rand() <= 0.5
            entries[i], entries[i+1] = dummy_entries[i], dummy_entries[i+1]
            A[1:entries[i], i] .= dummy_A[1:entries[i], i]
            A[1:entries[i+1], i+1] .= dummy_A[1:entries[i+1], i+1]
        end
    end
end
rows, coloumns = 990, 99990
A = zeros(Int64, rows, coloumns)
entries = sample(1:floor(Int, 0.75 * rows), coloumns; replace=true)
for i in axes(A, 2)
    A[1:entries[i], i] .= shuffle(1:entries[i])
end
dummy_A = deepcopy(A)
dummy_entries = deepcopy(entries)

I imagine most of what you’re seeing in your timing is this initialisation outside of the loop:

function test()
    rows, coloumns = 990, 99990
    A = zeros(Int64, rows, coloumns)
    entries = sample(1:floor(Int, 0.75 * rows), coloumns; replace=true)
    for i in axes(A, 2)
        A[1:entries[i], i] .= shuffle(1:entries[i])
    end
    dummy_A = deepcopy(A)
    dummy_entries = deepcopy(entries)
end 
@benchmark test()
julia> @benchmark test()
BenchmarkTools.Trial: 6 samples with 1 evaluation.
 Range (min … max):  830.651 ms … 948.136 ms  ┊ GC (min … max):  2.56% … 20.25%
 Time  (median):     928.466 ms               ┊ GC (median):    20.84%
 Time  (mean ± σ):   912.143 ms ±  42.364 ms  ┊ GC (mean ± σ):  16.69% ±  7.64%

  ▁                                     ▁           █  ▁      ▁  
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁█▁▁█▁▁▁▁▁▁█ ▁
  831 ms           Histogram: frequency by time          948 ms <

 Memory estimate: 1.77 GiB, allocs estimate: 100002.

Yes, @DanielVandH I didn’t know how to use @bencmark like this

function main()
     @benchmark begin 
               test()
        end
end

or any other way to profile from inside a function definition.

In any case, my parallel version should take less time than the serial version regardless of how we use @benchmark right…?

But unless I misread your posts it is taking less time? You show a 35-40ms difference in favour of the parallel version. I think Daniel’s point is that if most of your 750ms total runtime is actually taken up by the initialisation then parallelising the loop isn’t going to change overall runtime much.

2 Likes

Your code is indeed faster with the parallel version - as @nilshg points out, your benchmarks actually show this. Here’s one way to see the correct benchmark, which again shows the 40ms difference.

using FLoops, StatsBase, Random, BenchmarkTools
function test(dummy_entries_dummy_A, coloumns, ex)
    dummy_A, dummy_entries = dummy_entries_dummy_A
    @floop ex for i in 1:2:coloumns-1
        index1 = rand(1:dummy_entries[i])
        index2 = rand(1:dummy_entries[i+1])
        temp = dummy_A[index1, i]
        for j in index1:dummy_entries[i]
            dummy_A[j, i] = dummy_A[j+1, i]
        end
        dummy_entries[i] = dummy_entries[i] - 1
        for j in dummy_entries[i+1]:-1:index2
            dummy_A[j+1, i+1] = dummy_A[j, i+1]
        end
        dummy_A[index2, i+1] = temp
        dummy_entries[i+1] = dummy_entries[i+1] + 1
        if rand() <= 0.5
            entries[i], entries[i+1] = dummy_entries[i], dummy_entries[i+1]
            A[1:entries[i], i] .= dummy_A[1:entries[i], i]
            A[1:entries[i+1], i+1] .= dummy_A[1:entries[i+1], i+1]
        end
    end
end
function get_data()
    rows, coloumns = 990, 99990
    A = zeros(Int64, rows, coloumns)
    entries = sample(1:floor(Int, 0.75 * rows), coloumns; replace=true)
    for i in axes(A, 2)
        A[1:entries[i], i] .= shuffle(1:entries[i])
    end
    dummy_A = deepcopy(A)
    dummy_entries = deepcopy(entries)
    return dummy_A, dummy_entries
end

par_time = @benchmark test(dummy_entries_dummy_A, $coloumns, $ThreadedEx()) setup = (dummy_entries_dummy_A=get_data()) evals = 1
ser_time = @benchmark test(dummy_entries_dummy_A, $coloumns, $SequentialEx()) setup = (dummy_entries_dummy_A=get_data()) evals = 1

par_time 
ser_time
julia> par_time
BenchmarkTools.Trial: 8 samples with 1 evaluation.
 Range (min … max):  14.980 ms … 20.065 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     16.672 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   17.255 ms ±  1.553 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▁                ▁▁█                 ▁ ▁                  ▁
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁███▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  15 ms           Histogram: frequency by time        20.1 ms <

 Memory estimate: 162.54 MiB, allocs estimate: 642452.

julia> ser_time
BenchmarkTools.Trial: 7 samples with 1 evaluation.
 Range (min … max):  72.562 ms … 125.290 ms  ┊ GC (min … max):  0.00% … 40.32%
 Time  (median):     89.215 ms               ┊ GC (median):    25.48%
 Time  (mean ± σ):   92.077 ms ±  15.997 ms  ┊ GC (mean ± σ):  25.40% ± 11.91%

  ▁               ▁ █ ▁▁                                     ▁
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁█▁██▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  72.6 ms         Histogram: frequency by time          125 ms <

 Memory estimate: 162.49 MiB, allocs estimate: 641492.

I understand why allocations make for slow code in general. But, how does that interact with threading? (Is memory handling a common thread, or something?)

The garbage collector can be triggered by memory allocation in any one thread, and when the garbage collector kicks in, every thread is paused during the duration. So, as you add more threads, a larger and larger fraction of your time will be spent garbage collecting.
The solution to this is to implement a multithreaded GC (on the Julia dev side), and to reduce allocations (on the user side)

There may well be other reasons too that I don’t know about.

7 Likes

Dear @DanielVandH and @nilshg seems like you are right. My actual code is somewhat like this

function test(arguments)

   @floop for i in 1:n
          . 
          .
          .
     end
end
function main()
    . 
    .

    test(arguements)
end

main()

whenever I am using @floop in the test() function my code is getting 3x slower. I am unable to figure out why, the algorithm is very similar to the one explained above, there are very fewer allocations happening inside the loop, and there is no data race still the code is getting slower. I am unable to figure out why.

There could be many reasons, but just note that multithreading by itself is not guaranteed to give you faster code - threads take time to get started. Try making sure you get the most performance out of your serial code first and then timing how much a single iteration of your loop actually takes. If the loop is quick, I doubt multithreading will help.

3 Likes

@jakobnissen already gave a very good answer to this. You can often notice the difference of a multi-threaded GC by comparing the same code when run using @threads and when it’s run using Distributed. In the latter case, Julia will use additional workers, each of which is its own julia instance and thus has its own GC, and this can often outperform threads in the presence of lots of allocations.

1 Like

Aside from overhead from launching and synchronising threads, running a loop in parallel can also changes the order in which the loop is executed, changing the order of memory reads/writes which can break any automatic SIMD that the compiler does. This can usually be remedied by splitting the loop into contiguous chunks using Iterators.partition or a package like ChunkSplitters.jl.

Additionally, if the elements that you write/read to in an array are too close together, you may also suffer from false sharing, which can drastically slow down your code when using multiple threads.

In short, there are many possible causes of performance degradation with multithreading, and it’s really hard to know what is causing the problem without looking at the real code and experimenting.

1 Like