so, I am trying to understand how multi threading works, I started julia as julia -t 16 on a 32 cores node where I reserved 8, then I run the example from the documentation here these are the lines
function sum_single(a)
s = 0
for i in a
s += i
end
s
end
function sum_multi_good(a)
chunks = Iterators.partition(a, length(a) Γ· Threads.nthreads())
tasks = map(chunks) do chunk
Threads.@spawn sum_single(chunk)
end
chunk_sums = fetch.(tasks)
return sum_single(chunk_sums)
end
then I did the benchmarks (after loading Benchmarktools) and here are the results:
why is the parallel version so much slower than the single threaded? (11.7e3/3.25)!
In general I donβt seem to get significant gains , from multithreading, I am using it to evaluate the stiffness matrices of individual elements in a finite element program before assembling the structure stiffness matrix, each element evaluation is quite expensive but they independent from the others, so it seems sort of the best case scenario for parallel computing but still I donβe seem to get significant time savings
The compiler is simply too smart for your simple example!
To see this, you can have a look at @code_llvm sum_single(1:1_000_000) (or @code_native if you prefer) to inspect how the compiler has digested your call, and you will notice that there is no loop remainingβ¦ Thatβs because LLVM has built-in optimization for this particular kind of computation, so it will just look at the first and last element of the range, and give you your result. Obviously, doing that computation is much cheaper than allocating tasks to do parallel stuff, so however fast sum_multi_good can go, sum_single will always be faster. By the way, you can notice that the timing for sum_single is around the nanosecond, which is of the order of precision of @benchmark (benchmarking 2*7 would probably yield a similar timing), which also confirms that your sum_single call is barely doing any computation. You could also check that the timing does not depend on the length of the input range.
To bypass this cleverness of the compiler, you should input something that it cannot optimize away. The easiest is probably to feed sum_single with an array, instead of a range. This is what I get using 10 threads:
julia> @benchmark sum_single($(collect(1:1_000_000)))
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min β¦ max): 356.141 ΞΌs β¦ 1.310 ms β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 363.795 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 368.972 ΞΌs Β± 18.412 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
ββββββ
βββββββββ ββββ β ββββββββββββββββββββββββββββββββββββββββββββββ β
356 ΞΌs Histogram: frequency by time 431 ΞΌs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark sum_multi_good($(collect(1:1_000_000)))
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min β¦ max): 40.893 ΞΌs β¦ 6.037 ms β GC (min β¦ max): 0.00% β¦ 94.62%
Time (median): 54.895 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 57.975 ΞΌs Β± 78.316 ΞΌs β GC (mean Β± Ο): 1.81% Β± 1.34%
βββ β βββββββ β βββββ
βββββββββββββββββββββββββββββββ β β β ββ βββββββββββββββββββββββ β
40.9 ΞΌs Histogram: frequency by time 83.7 ΞΌs <
Memory estimate: 11.33 KiB, allocs estimate: 130.
Just a last note on performance: your current code is type unstable, because fetch is inherently type unstable (itβs impossible for the compiler to guess the return type of a Task), so you should annotate the call to give the type to the compiler. In this instance, you could write
T = eltype(a)
chunk_sums = T[fetch(t) for t in tasks]