It seems that compilation time is not properly computed when a function is run multi-threaded, apparently increasing with the number of threads. For example, take the standard sum example of the docs:
julia> function sum_single(a)
s = 0
for i in a
s += i
end
s
end
sum_single (generic function with 1 method)
julia> function sum_multi_good(a)
chunks = Iterators.partition(a, length(a) ÷ Threads.nthreads())
tasks = map(chunks) do chunk
Threads.@spawn sum_single(chunk)
end
chunk_sums = fetch.(tasks)
return sum_single(chunk_sums)
end
sum_multi_good (generic function with 1 method)
If I start Julia with 4 threads, I get:
julia> @time sum_multi_good(1:4_000_000)
0.040086 seconds (49.33 k allocations: 3.509 MiB, 148.52% compilation time)
8000002000000
julia> @time sum_multi_good(1:4_000_000)
0.000185 seconds (34 allocations: 2.516 KiB)
8000002000000
Now if I start with 8 threads, but increase the workload by 2, I get:
julia> @time sum_multi_good(1:8_000_000)
0.040237 seconds (49.86 k allocations: 3.537 MiB, 335.67% compilation time)
32000004000000
julia> @time sum_multi_good(1:8_000_000)
0.000079 seconds (58 allocations: 4.641 KiB)
32000004000000
And now with 16 threads:
julia> @time sum_multi_good(1:16_000_000)
0.040438 seconds (50.94 k allocations: 3.592 MiB, 484.09% compilation time)
128000008000000
julia> @time sum_multi_good(1:16_000_000)
0.000252 seconds (106 allocations: 8.891 KiB)
128000008000000
Note how the excution time is similar for the first run in all cases, but reported compilation time is increasing.
With 32 threads I get:
julia> @time sum_multi_good(1:32_000_000)
0.105830 seconds (53.08 k allocations: 3.703 MiB, 1804.13% compilation time)
512000016000000
(my computer does not have such many cores).
These compilation times do not seem to make sense. Do they?