Using and understanding multi-threading

Andrea_Vigliotti · January 11, 2024, 1:10pm

so, I am trying to understand how multi threading works, I started julia as julia -t 16 on a 32 cores node where I reserved 8, then I run the example from the documentation here these are the lines

function sum_single(a)
     s = 0
     for i in a
         s += i
     end
     s
 end

function sum_multi_good(a)
     chunks = Iterators.partition(a, length(a) ÷ Threads.nthreads())
     tasks = map(chunks) do chunk
         Threads.@spawn sum_single(chunk)
     end
     chunk_sums = fetch.(tasks)
     return sum_single(chunk_sums)
 end

then I did the benchmarks (after loading Benchmarktools) and here are the results:

julia> @benchmark sum_single(1:1_000_000)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  3.246 ns … 13.705 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     3.250 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.271 ns ±  0.252 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▇                          ▂                            ▁ ▁
  ██▄▃▁▅▇▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▄▄▅▁██▇▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ █
  3.25 ns      Histogram: log(frequency) by time      3.5 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

@benchmark sum_multi_good(1:1_000_000)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  11.700 μs …  4.115 ms  ┊ GC (min … max): 0.00% … 96.91%
 Time  (median):     34.866 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   38.832 μs ± 57.699 μs  ┊ GC (mean ± σ):  1.98% ±  1.37%

                   ▂▅▇█▅▃▁
  ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▃▆████████▇▆▇▇▆▆▆▅▅▅▅▅▄▄▄▄▃▄▄▅▆▅▅▅▅▄▄▃▃▃▂▂▂▂ ▃
  11.7 μs         Histogram: frequency by time        62.3 μs <

 Memory estimate: 8.89 KiB, allocs estimate: 106.

why is the parallel version so much slower than the single threaded? (11.7e3/3.25)!

In general I don’t seem to get significant gains , from multithreading, I am using it to evaluate the stiffness matrices of individual elements in a finite element program before assembling the structure stiffness matrix, each element evaluation is quite expensive but they independent from the others, so it seems sort of the best case scenario for parallel computing but still I don’e seem to get significant time savings

many thanks in advance!

Liozou · January 11, 2024, 2:01pm

The compiler is simply too smart for your simple example!
To see this, you can have a look at @code_llvm sum_single(1:1_000_000) (or @code_native if you prefer) to inspect how the compiler has digested your call, and you will notice that there is no loop remaining… That’s because LLVM has built-in optimization for this particular kind of computation, so it will just look at the first and last element of the range, and give you your result. Obviously, doing that computation is much cheaper than allocating tasks to do parallel stuff, so however fast sum_multi_good can go, sum_single will always be faster. By the way, you can notice that the timing for sum_single is around the nanosecond, which is of the order of precision of @benchmark (benchmarking 2*7 would probably yield a similar timing), which also confirms that your sum_single call is barely doing any computation. You could also check that the timing does not depend on the length of the input range.

To bypass this cleverness of the compiler, you should input something that it cannot optimize away. The easiest is probably to feed sum_single with an array, instead of a range. This is what I get using 10 threads:

julia> @benchmark sum_single($(collect(1:1_000_000)))
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  356.141 μs …  1.310 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     363.795 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   368.972 μs ± 18.412 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▁▃▃█▂▁                                                       
  ▄██████▆▅▄▄▄▅▅▄▄▄▄▄▄▄▄▄▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▁▂▂▂▂▂▂ ▃
  356 μs          Histogram: frequency by time          431 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark sum_multi_good($(collect(1:1_000_000)))
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  40.893 μs …  6.037 ms  ┊ GC (min … max): 0.00% … 94.62%
 Time  (median):     54.895 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   57.975 μs ± 78.316 μs  ┊ GC (mean ± σ):  1.81% ±  1.34%

           ▁▃▅▅▇▇▇██▇▅▅▄▃▂▂▁                                   
  ▁▁▂▂▂▃▄▆▇██████████████████▇▇▇▅▅▅▅▄▅▄▄▄▃▄▃▃▃▂▃▂▂▂▂▂▂▂▂▁▂▁▁▁ ▄
  40.9 μs         Histogram: frequency by time        83.7 μs <

 Memory estimate: 11.33 KiB, allocs estimate: 130.

Just a last note on performance: your current code is type unstable, because fetch is inherently type unstable (it’s impossible for the compiler to guess the return type of a Task), so you should annotate the call to give the type to the compiler. In this instance, you could write

T = eltype(a)
chunk_sums = T[fetch(t) for t in tasks]

instead of

chunk_sums = fetch.(tasks)

for example

Topic		Replies	Views
How to run tasks in parallel? General Usage first-steps , multithreading	6	1386	February 22, 2020
Profiling of Multi-threaded code General Usage	5	1398	December 14, 2016
Expected 72X speedup, observed 1.1X Performance	5	764	April 28, 2021
Slowdown when multi-threading is used General Usage	8	2240	January 19, 2017
Slower execution with multi-threading using @threads macro Performance question , parallel , multithreading	5	765	August 13, 2020

Using and understanding multi-threading

Related topics