Why does a non-allocating function seem to allocate when being called in a multi-threaded map?

I am measuring allocations using TimerOutputs. Here is a MWE:

julia> Threads.nthreads()
2

julia> using TimerOutputs, Folds

julia> f!(M) = begin
       t = TimerOutput()
       @timeit t "update" M .= 1;
       println(t)
       M
       end;

julia> Ms = [zeros(2000, 2000) for i in 1:2Threads.nthreads()];

julia> @time f!(Ms[1]); # non-allocating, as expected
 ────────────────────────────────────────────────────────────────────
                            Time                    Allocations      
                   ───────────────────────   ────────────────────────
 Tot / % measured:     3.56ms /  99.4%           1.55KiB /   0.0%    

 Section   ncalls     time    %tot     avg     alloc    %tot      avg
 ────────────────────────────────────────────────────────────────────
 update         1   3.53ms  100.0%  3.53ms     0.00B     - %    0.00B
 ────────────────────────────────────────────────────────────────────
  0.004002 seconds (252 allocations: 18.156 KiB)

julia> g(Ms) = Folds.map(f!, Ms);

julia> g(Ms);
 ────────────────────────────────────────────────────────────────────
                            Time                    Allocations      
                   ───────────────────────   ────────────────────────
 Tot / % measured:     6.31ms /  99.7%           3.09KiB /  50.0%    

 Section   ncalls     time    %tot     avg     alloc    %tot      avg
 ────────────────────────────────────────────────────────────────────
 update         1   6.29ms  100.0%  6.29ms   1.55KiB  100.0%  1.55KiB
 ────────────────────────────────────────────────────────────────────
 ────────────────────────────────────────────────────────────────────
                            Time                    Allocations      
                   ───────────────────────   ────────────────────────
 Tot / % measured:     6.92ms /  92.2%           20.0KiB /  30.3%    

 Section   ncalls     time    %tot     avg     alloc    %tot      avg
 ────────────────────────────────────────────────────────────────────
 update         1   6.38ms  100.0%  6.38ms   6.06KiB  100.0%  6.06KiB
 ────────────────────────────────────────────────────────────────────
 ────────────────────────────────────────────────────────────────────
                            Time                    Allocations      
                   ───────────────────────   ────────────────────────
 Tot / % measured:     6.08ms /  89.2%           18.0KiB /  33.5%    

 Section   ncalls     time    %tot     avg     alloc    %tot      avg
 ────────────────────────────────────────────────────────────────────
 update         1   5.42ms  100.0%  5.42ms   6.05KiB  100.0%  6.05KiB
 ────────────────────────────────────────────────────────────────────
 ────────────────────────────────────────────────────────────────────
                            Time                    Allocations      
                   ───────────────────────   ────────────────────────
 Tot / % measured:     6.69ms /  99.6%           18.4KiB /  91.6%    

 Section   ncalls     time    %tot     avg     alloc    %tot      avg
 ────────────────────────────────────────────────────────────────────
 update         1   6.67ms  100.0%  6.67ms   16.9KiB  100.0%  16.9KiB
 ────────────────────────────────────────────────────────────────────

Where are these allocations coming from? They also seem random, and change from one call to another, which makes me wonder if these are real at all.

Threading primitives have to allocate as well, e.g. the spawned tasks that are used for multithreading.

I’m also not sure if @time and TimerOutputs are suitable for stable allocation tracking, do you see the same behavior with a custom threaded loop and BenchmarkTools?

Indeed, the issue seems to disappear using BenchmarkTools. Perhaps it’s an intrinsic limitation of TimerOutputs.

julia> f!(M) = @btime $M .= 1 evals=1;

julia> function g(Ms)
           Threads.@threads for i in eachindex(Ms)
               M = Ms[i]
               f!(M)
           end
           Ms
       end;

julia> M = [zeros(2000, 2000) for i in 1:2Threads.nthreads()];

julia> g(M);
  3.216 ms (0 allocations: 0 bytes)
  3.265 ms (0 allocations: 0 bytes)
  3.169 ms (0 allocations: 0 bytes)
  2.902 ms (0 allocations: 0 bytes)

julia> using Folds

julia> g2(M) = Folds.map(f!, M);

julia> g2(M);
  3.202 ms (0 allocations: 0 bytes)
  3.156 ms (0 allocations: 0 bytes)
  2.844 ms (0 allocations: 0 bytes)
  2.925 ms (0 allocations: 0 bytes)
1 Like