Using and understanding multi-threading

so, I am trying to understand how multi threading works, I started julia as julia -t 16 on a 32 cores node where I reserved 8, then I run the example from the documentation here these are the lines

function sum_single(a)
     s = 0
     for i in a
         s += i

function sum_multi_good(a)
     chunks = Iterators.partition(a, length(a) Γ· Threads.nthreads())
     tasks = map(chunks) do chunk
         Threads.@spawn sum_single(chunk)
     chunk_sums = fetch.(tasks)
     return sum_single(chunk_sums)

then I did the benchmarks (after loading Benchmarktools) and here are the results:

julia> @benchmark sum_single(1:1_000_000)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  3.246 ns … 13.705 ns  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     3.250 ns              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   3.271 ns Β±  0.252 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–ˆβ–‡                          β–‚                            ▁ ▁
  β–ˆβ–ˆβ–„β–ƒβ–β–…β–‡β–β–β–β–β–β–β–β–β–β–β–β–β–β–ƒβ–β–β–„β–„β–…β–β–ˆβ–ˆβ–‡β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–ˆ β–ˆ
  3.25 ns      Histogram: log(frequency) by time      3.5 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
@benchmark sum_multi_good(1:1_000_000)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  11.700 ΞΌs …  4.115 ms  β”Š GC (min … max): 0.00% … 96.91%
 Time  (median):     34.866 ΞΌs              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   38.832 ΞΌs Β± 57.699 ΞΌs  β”Š GC (mean Β± Οƒ):  1.98% Β±  1.37%

  β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–‚β–ƒβ–†β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–†β–‡β–‡β–†β–†β–†β–…β–…β–…β–…β–…β–„β–„β–„β–„β–ƒβ–„β–„β–…β–†β–…β–…β–…β–…β–„β–„β–ƒβ–ƒβ–ƒβ–‚β–‚β–‚β–‚ β–ƒ
  11.7 ΞΌs         Histogram: frequency by time        62.3 ΞΌs <

 Memory estimate: 8.89 KiB, allocs estimate: 106.

why is the parallel version so much slower than the single threaded? (11.7e3/3.25)!

In general I don’t seem to get significant gains , from multithreading, I am using it to evaluate the stiffness matrices of individual elements in a finite element program before assembling the structure stiffness matrix, each element evaluation is quite expensive but they independent from the others, so it seems sort of the best case scenario for parallel computing but still I don’e seem to get significant time savings

many thanks in advance!

The compiler is simply too smart for your simple example!
To see this, you can have a look at @code_llvm sum_single(1:1_000_000) (or @code_native if you prefer) to inspect how the compiler has digested your call, and you will notice that there is no loop remaining… That’s because LLVM has built-in optimization for this particular kind of computation, so it will just look at the first and last element of the range, and give you your result. Obviously, doing that computation is much cheaper than allocating tasks to do parallel stuff, so however fast sum_multi_good can go, sum_single will always be faster. By the way, you can notice that the timing for sum_single is around the nanosecond, which is of the order of precision of @benchmark (benchmarking 2*7 would probably yield a similar timing), which also confirms that your sum_single call is barely doing any computation. You could also check that the timing does not depend on the length of the input range.

To bypass this cleverness of the compiler, you should input something that it cannot optimize away. The easiest is probably to feed sum_single with an array, instead of a range. This is what I get using 10 threads:

julia> @benchmark sum_single($(collect(1:1_000_000)))
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  356.141 ΞΌs …  1.310 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     363.795 ΞΌs              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   368.972 ΞΌs Β± 18.412 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–„β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–…β–„β–„β–„β–…β–…β–„β–„β–„β–„β–„β–„β–„β–„β–„β–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–β–‚β–‚β–‚β–‚β–‚β–β–‚β–‚β–‚β–‚β–‚β–‚ β–ƒ
  356 ΞΌs          Histogram: frequency by time          431 ΞΌs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark sum_multi_good($(collect(1:1_000_000)))
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  40.893 ΞΌs …  6.037 ms  β”Š GC (min … max): 0.00% … 94.62%
 Time  (median):     54.895 ΞΌs              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   57.975 ΞΌs Β± 78.316 ΞΌs  β”Š GC (mean Β± Οƒ):  1.81% Β±  1.34%

  β–β–β–‚β–‚β–‚β–ƒβ–„β–†β–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–‡β–…β–…β–…β–…β–„β–…β–„β–„β–„β–ƒβ–„β–ƒβ–ƒβ–ƒβ–‚β–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–β–‚β–β–β– β–„
  40.9 ΞΌs         Histogram: frequency by time        83.7 ΞΌs <

 Memory estimate: 11.33 KiB, allocs estimate: 130.

Just a last note on performance: your current code is type unstable, because fetch is inherently type unstable (it’s impossible for the compiler to guess the return type of a Task), so you should annotate the call to give the type to the compiler. In this instance, you could write

T = eltype(a)
chunk_sums = T[fetch(t) for t in tasks]

instead of

chunk_sums = fetch.(tasks)

for example