How to avoid unneccessary memory allocation for inplace estimation of `mean`?

I’m doing micro-optimization of my code, and current bottleneck is a place, where I call mean millions of times on small 2d arrays. And it leads to huge number of allocations. Simply re-implementing mean removes them all. Minimal example is the following:

using Statistics

function base_mean(xs::Array{Matrix{Float64}, 1})
    tm = [0.0, 0.0]
    for x in xs
        tm .= vec(mean(x, dims=1))
    end
    return tm
end

function custom_mean(xs::Array{Matrix{Float64}, 1})
    tm = [0.0, 0.0]
    for x in xs
        tm .= 0.0
        for j in 1:size(x, 2)
            for i in 1:size(x, 1)
                tm[j] += x[i,j]
            end
            tm[j] /= size(x, 1)
        end
    end
    return tm
end

t_xs = [rand(20, 2) for i in 1:10000];
println(@benchmark base_mean(t_xs))
println(@benchmark custom_mean(t_xs))

Results for base_mean:

BenchmarkTools.Trial: 
  memory estimate:  1.68 MiB
  allocs estimate:  30001
  --------------
  minimum time:     1.491 ms (0.00% GC)
  median time:      1.529 ms (0.00% GC)
  mean time:        2.193 ms (29.34% GC)
  maximum time:     18.634 ms (91.35% GC)
  --------------
  samples:          2273
  evals/sample:     1

For custom_mean:

BenchmarkTools.Trial: 
  memory estimate:  96 bytes
  allocs estimate:  1
  --------------
  minimum time:     347.485 μs (0.00% GC)
  median time:      352.084 μs (0.00% GC)
  mean time:        356.737 μs (0.00% GC)
  maximum time:     20.283 ms (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

So, it’s 5 time faster and use no memory. According to the manual, .= should be optimized for allocations. I had a hypothesis that the problem is that mean returns 2d array, and I convert it into 1d with vec, but first, having 2d mean doesn’t solve the problem (benchmark below), and second, I need 1d…

function base_mean_2d(xs::Array{Matrix{Float64}, 1})
    tm = [0.0 0.0]
    for x in xs
        tm .= mean(x, dims=1)
    end
    return tm
end

@benchmark base_mean_2d(t_xs)
BenchmarkTools.Trial: 
  memory estimate:  937.59 KiB
  allocs estimate:  10001
  --------------
  minimum time:     1.046 ms (0.00% GC)
  median time:      1.058 ms (0.00% GC)
  mean time:        1.393 ms (22.87% GC)
  maximum time:     22.613 ms (94.87% GC)
  --------------
  samples:          3583
  evals/sample:     1

You’re looking for mean!.

julia> @benchmark custom_mean($t_xs)
BenchmarkTools.Trial:
  memory estimate:  96 bytes
  allocs estimate:  1
  --------------
  minimum time:     735.905 μs (0.00% GC)
  median time:      741.241 μs (0.00% GC)
  mean time:        743.940 μs (0.00% GC)
  maximum time:     2.455 ms (0.00% GC)
  --------------
  samples:          6687
  evals/sample:     1

julia> @benchmark base_mean($t_xs)
BenchmarkTools.Trial:
  memory estimate:  1.68 MiB
  allocs estimate:  30001
  --------------
  minimum time:     1.491 ms (0.00% GC)
  median time:      1.513 ms (0.00% GC)
  mean time:        1.630 ms (6.48% GC)
  maximum time:     9.847 ms (84.13% GC)
  --------------
  samples:          3061
  evals/sample:     1

julia> function base_mean2(xs)
           tm = [0.0 0.0]
           for x ∈ xs
               mean!(tm, x)
           end
           tm
       end
base_mean2 (generic function with 1 method)

julia> @benchmark base_mean2($t_xs)
BenchmarkTools.Trial:
  memory estimate:  96 bytes
  allocs estimate:  1
  --------------
  minimum time:     616.917 μs (0.00% GC)
  median time:      627.587 μs (0.00% GC)
  mean time:        629.694 μs (0.00% GC)
  maximum time:     2.118 ms (0.00% GC)
  --------------
  samples:          7848
  evals/sample:     1

The problem with the broadcasted approach is that mean allocates the return vector, and then copies over the contents to tm. The inplace version, mean!, lets you pass in a preallocated output, so that it can write into it directly.

3 Likes

Thanks, Elrod! It solves the problem with 2d case, but for 1d it’s not that great:

function base_mean3(xs::Array{Matrix{Float64}, 1})
    tm = [0.0, 0.0]
    for x in xs
        mean!(tm, x')
    end
    return tm
end

@benchmark base_mean3(t_xs)
BenchmarkTools.Trial: 
  memory estimate:  156.34 KiB
  allocs estimate:  10001
  --------------
  minimum time:     901.816 μs (0.00% GC)
  median time:      909.589 μs (0.00% GC)
  mean time:        974.607 μs (5.16% GC)
  maximum time:     20.935 ms (0.00% GC)
  --------------
  samples:          5092
  evals/sample:     1

So it still uses some memory (much less though). And it’s 3 times slower than the custom implementation (indeed, it’s also slower for 2d case).

Only if I pre-transpose all arrays, and remove transposition from base_mean it stops allocating memory. But still, 2 times slower than custom_mean. Not even to say that it requires me to change data storing format in the whole program.

function base_mean4(xs::Array{Matrix{Float64}, 1})
    tm = [0.0, 0.0]
    for x in xs
        mean!(tm, x)
    end
    return tm
end

t_xs_tr = [rand(2, 20) for i in 1:10000];

@benchmark base_mean4(t_xs_tr)
BenchmarkTools.Trial: 
  memory estimate:  96 bytes
  allocs estimate:  1
  --------------
  minimum time:     733.780 μs (0.00% GC)
  median time:      736.829 μs (0.00% GC)
  mean time:        743.638 μs (0.00% GC)
  maximum time:     4.102 ms (0.00% GC)
  --------------
  samples:          6689
  evals/sample:     1

It’s unfortunate that transposing allocates memory if the variable is heap allocated and escapes (same problem as views).
Perhaps instead of transposing each x in the loop, you can transpose tm before the loop?

Alternatively, you could define tm like I did:

julia> tm = [0.0 0.0]
1×2 Array{Float64,2}:
 0.0  0.0

In my full program, tm is field of a class, which is conceptually 1d. So changing it to 2d would be absolutely confusing. And about transposing x before loop, please see the benchmark above (base_mean4): it’s still slower.

tm is field of a class, which is conceptually 1d

Could transpose tm before the loop?

Nope, I need to store the result exactly in these classes. But in terms of final implementation of this particular case, I’m fine with custom mean (or even with base_mean4). I just got overwhelmed with large memory allocations for small operations, and wondering whether there is a general pattern on how to avoid them. Example with the mean was the simplest part of my code I could extract, so I posted this hoping to get some insides…

Particularly, when I replaced mean estimation in my code and ran whole simulation in 10 threads, it allocates 30% less memory, but GC time is the same, and performance is only slightly increased.
Results of @time before:

565.496121 seconds (2.96 G allocations: 253.154 GiB, 86.91% gc time)

and after:

549.733371 seconds (947.10 M allocations: 174.102 GiB, 89.40% gc time)

Depending on the problem set you could have a look at https://github.com/JuliaArrays/StaticArrays.jl as you have mentioned that you are working with “small” allocations. If you know the size of your data structures in advance this can help speed up things.

1 Like

This suggests that mean wasn’t your primary bottleneck time-wise. The tip from @Elrod about using the mutating version mean! can probably be used in other places in your code as well. Judging by the large percentage of time spent doing GC, you’re creating a lot of intermediary objects and throw them away just as fast as you’re creating them. I don’t know if that’s part of the nature of your problem, but if you can reduce the GC time to maybe just 20% by reusing memory, you’d have a run time of ~73s.

Summary

549.733371s * (1 - 0.894) ≈ ‭58.271733086‬s
‭58.271733086‬s / 0.8 ≈ 72.8s

1 Like

It is a bit distracting that your function throws away all the results except the last one, so that you can replace the whole thing with

mean(last(t_xs); dims=1)

I realize that this is probably just a side effect of simplifying the problem for demonstration purposes, but it makes it hard to look at your code and identifying other possible optimizations.

Could you clarify whether you are just trying to figure out how to optimize mean(::Matrix; dims=1), or if there is some more structure to this particular sub-problem?

1 Like

Anytime I see a mention of “small” matrices, my first thoughts go to StaticArrays. Have you tried using those instead? Maybe they would not have an advantage here because you are just looping through them, but could be worth a shot.

1 Like

@tbeason, @laborg, thanks so much! StaticArrays made the deal! Now all small operations, which I didn’t expect to consume memory actually do not consume it. Which helped to find several places where large arrays are created and lead to allocations. So I still have a lot of allocations, but at least now I understand where it came from.

522.473471 seconds (842.00 M allocations: 137.613 GiB, 88.45% gc time)

What I don’t understand is why reducing amount of allocated memory by 50% still didn’t change gc time. Can it be a problem of @time on multi-threaded code? And is there maybe a way to tune GC not to be so aggressive?

Yeah, otherwise I’d need to create an output array of means and store results there, which would make the example longer. My main goal was to understand how to make tiny operations efficient, and I was indeed thinking that in C++ I’d simply use stack with 0 heap allocation for that. But I didn’t know that such things are possible in Julia. So StaticArrays looks like perfect answer for my question. Thanks!