Why `mean` with `dims` argument is so slow?

Consider:

using BenchmarkTools, Statistics
A = randn(4,3,2);
@btime mean($A; dims=1);
#  90.184 ns (1 allocation: 128 bytes)
@btime mean($A);
#  9.891 ns (0 allocations: 0 bytes)

That’s a 10x factor difference! Is there a way to improve over this? A faster way to compute mean along a dimension?

Consider:

julia> A = randn(400,300,200);
julia> using BenchmarkTools
julia> @btime mean($A; dims=1);
  9.917 ms (2 allocations: 468.83 KiB)
julia> @btime mean($A);
  9.615 ms (0 allocations: 0 bytes)

Nanosseconds measures are bullshit.

Perhaps it would be a bit more accurate to say that nanosecond measurements are very delicate because you can very easily end up measuring something other than what you intended to measure. I believe this is what is happening in the benchmark reported by the OP: mean(A) returns a scalar and so it is allocation-free, while mean(A, dims=1) returns a vector and so it must perform at least one allocation, and this allocation ends up dominating the overall runtime.

Well, it’s an 80 ns difference. You can avoid some of it by pre-allocating the output in the shape you want:

julia> using BenchmarkTools, Statistics

julia> A = randn(4,3,2);

julia> @btime mean($A; dims=1);
  101.318 ns (1 allocation: 128 bytes)

julia> @btime mean($A);
  10.427 ns (0 allocations: 0 bytes)

julia> m = zeros(1, 3, 2);

julia> @btime mean!($m, $A);
  62.747 ns (0 allocations: 0 bytes)

Oh, I should have tested on larger matrices!