Extra allocation when trying to reuse pre-allocated vector

v-i-s-h · June 26, 2020, 3:47pm

Hi,

I am trying to generate data for logistic regression y (which is {0, 1}) from X and β. I already allocate the space for y and then wrote function to computing the same in two different ways. I then use TimerOutputs.jl to analyze the time and allocations.

Approach 1: Here the linear combination is first computed as z and is then used for generating samples.

function generate1!(y, X, β)
    @timeit "X * β" z = X * β 
    @timeit "p" p = logistic.(y)
    @timeit "y" y .= rand.(Bernoulli.(p))
    return y
end

Since, in the above approach, z is an auxiliary variable with new allocation (?), I thought of reusing y to compute the linear combination thinking it will save some allocations and hence time.
Approach 2: Here y is reused to store the intermediate result of X * β

function generate2!(y, X, β)
    @timeit "X * β" y .= X * β  # reuse already allocated "y"
    @timeit "p" p = logistic.(y)
    @timeit "y" y .= rand.(Bernoulli.(p))
    return y
end

Now I test it using the following code.

# Test data
X = randn(1000, 50)
β = rand(size(X, 2))
y = Vector(undef, size(X, 1))

# do precompile - both the test functions as well as timers
reset_timer!()
@timeit "" generate1!(y, X, β);
@timeit "" generate2!(y, X, β);

# Actual test
reset_timer!()
for i = 1:100
@timeit "generate1!()" generate1!(y, X, β);
@timeit "generate2!()" generate2!(y, X, β);
end
print_timer()

OUTPUT: After running, the second approach takes more time and more allocations.

 ───────────────────────────────────────────────────────────────────────
                                Time                   Allocations      
                        ──────────────────────   ───────────────────────
    Tot / % measured:        1.09s / 3.77%           83.5MiB / 8.28%    

 Section        ncalls     time   %tot     avg     alloc   %tot      avg
 ───────────────────────────────────────────────────────────────────────
 generate1!()      100   11.3ms  27.5%   113μs   1.55MiB  22.4%  15.9KiB
   z = X * β       100   2.72ms  6.62%  27.2μs    794KiB  11.2%  7.94KiB
   p               100   5.16ms  12.6%  51.6μs    794KiB  11.2%  7.94KiB
   y               100   3.17ms  7.72%  31.7μs     0.00B  0.00%    0.00B
 generate2!()      100   29.8ms  72.5%   298μs   5.37MiB  77.6%  55.0KiB
   y .= X * β      100   5.05ms  12.3%  50.5μs   2.30MiB  33.3%  23.6KiB   <<< Extra allocation even on reuse of "y"
   p               100   20.3ms  49.6%   203μs   3.06MiB  44.2%  31.3KiB   <<< Extra allocation; don't know why
   y               100   4.17ms  10.1%  41.7μs   7.81KiB  0.11%    80.0B   <<< Extra allocation; don't know why

 ───────────────────────────────────────────────────────────────────────

Why this is happening?

Ronis_BR · June 26, 2020, 4:20pm

Because of the matrix multiplication. Use the in-place mul! to avoid that.

EDIT: The mul! is defined inside the LinearAlgebra module.

rdeits · June 26, 2020, 4:30pm

This operation is not fully in-place. The * is not a broadcasted function, so this computes a new vector to hold the result of X * beta and then copies that vector into y. Matrix-vector multiplication is not an elementwise operation, so broadcasting isn’t really enough to save you from allocations here.

There’s a further problem which is that your y is a Vector{Any}. The performance cost of Vector{Any} completely dominates everything else in this example. You want Vector{Float64}(undef, size(X, 1)).

Fixing that, and using mul!, we can see the difference:

julia> function f1(y, A, B)
         y .= A * B
       end
f1 (generic function with 1 method)

julia> function f2(y, A, B)
         mul!(y, A, B)
       end
f2 (generic function with 1 method)

julia> @btime f1(y_any, $X, $β) setup=(y_any = Vector(undef, size($X, 1)));
  17.286 μs (1001 allocations: 23.56 KiB)

julia> @btime f1(y_concrete, $X, $β) setup=(y_concrete = Vector{Float64}(undef, size($X, 1)));
  6.816 μs (1 allocation: 7.94 KiB)

julia> @btime f2(y_concrete, $X, $β) setup=(y_concrete = Vector{Float64}(undef, size($X, 1)));
  5.798 μs (0 allocations: 0 bytes)

v-i-s-h · June 26, 2020, 4:47pm

Hi @Ronis_BR

Using mul!() increased the allocations
Code

function generate3!(y, X, β)
    @timeit "mul!(y, X, β)" mul!(y, X, β)  # reuse allocated "y"
    @timeit "p" p = logistic.(y)
    @timeit "y" y .= rand.(Bernoulli.(p))
end

Results

 ──────────────────────────────────────────────────────────────────────────
                                   Time                   Allocations      
                           ──────────────────────   ───────────────────────
     Tot / % measured:          1.61s / 33.3%            240MiB / 67.8%    

 Section           ncalls     time   %tot     avg     alloc   %tot      avg
 ──────────────────────────────────────────────────────────────────────────
 generate3!()         100    491ms  91.9%  4.91ms    156MiB  95.7%  1.56MiB
   mul!(y, X, β)      100    465ms  87.0%  4.65ms    153MiB  93.9%  1.53MiB   <<< here!!
   p                  100   20.1ms  3.76%   201μs   3.06MiB  1.88%  31.3KiB
   y                  100   5.64ms  1.06%  56.4μs   7.81KiB  0.00%    80.0B
 generate2!()         100   29.5ms  5.51%   295μs   5.37MiB  3.30%  55.0KiB
   p                  100   19.1ms  3.57%   191μs   3.06MiB  1.88%  31.3KiB
   y .= X * β         100   5.89ms  1.10%  58.9μs   2.30MiB  1.42%  23.6KiB
   y                  100   4.15ms  0.78%  41.5μs   7.81KiB  0.00%    80.0B
 generate1!()         100   13.7ms  2.56%   137μs   1.55MiB  0.96%  15.9KiB
   z = X * β          100   5.44ms  1.02%  54.4μs    794KiB  0.48%  7.94KiB
   p                  100   4.74ms  0.89%  47.4μs    794KiB  0.48%  7.94KiB
   y                  100   3.17ms  0.59%  31.7μs     0.00B  0.00%    0.00B
 ──────────────────────────────────────────────────────────────────────────

rdeits · June 26, 2020, 4:49pm

It’s because your y is a Vector{Any}. I’ve got more detail on this in my post above.

v-i-s-h · June 26, 2020, 4:54pm

Hi @rdeits, Thanks a lot!. This helped.

y = Vector{eltype(X)}(undef, size(X, 1))

New results

 ──────────────────────────────────────────────────────────────────────────
                                   Time                   Allocations      
                           ──────────────────────   ───────────────────────
     Tot / % measured:          1.16s / 2.61%           81.0MiB / 4.79%    

 Section           ncalls     time   %tot     avg     alloc   %tot      avg
 ──────────────────────────────────────────────────────────────────────────
 generate2!()         100   10.4ms  34.1%   104μs   1.55MiB  40.0%  15.9KiB
   p                  100   5.20ms  17.1%  52.0μs    794KiB  20.0%  7.94KiB
   y                  100   3.07ms  10.1%  30.7μs     0.00B  0.00%    0.00B
   y .= X * β         100   1.92ms  6.32%  19.2μs    794KiB  20.0%  7.94KiB
 generate1!()         100   10.3ms  34.0%   103μs   1.55MiB  40.0%  15.9KiB
   p                  100   5.21ms  17.2%  52.1μs    794KiB  20.0%  7.94KiB
   y                  100   2.97ms  9.77%  29.7μs     0.00B  0.00%    0.00B
   z = X * β          100   1.95ms  6.41%  19.5μs    794KiB  20.0%  7.94KiB
 generate3!()         100   9.69ms  31.9%  96.9μs    796KiB  20.0%  7.96KiB   <<< Best!!
   p                  100   5.12ms  16.8%  51.2μs    794KiB  20.0%  7.94KiB
   y                  100   3.10ms  10.2%  31.0μs     0.00B  0.00%    0.00B
   mul!(y, X, β)      100   1.28ms  4.22%  12.8μs     0.00B  0.00%    0.00B
 ──────────────────────────────────────────────────────────────────────────

jondeuce · June 27, 2020, 9:13am

Note that p = logistic.(y) also allocates a new vector p. Replacing p = logistic.(y); y .= rand.(Bernoulli.(p)) with y .= rand.(Bernoulli.(logistic.(y))) would remove more allocations, and likely give a bit more speedup.

Topic		Replies	Views
Help to reduce number of allocations General Usage	23	5489	January 6, 2019
Common allocation mistakes Performance memory-allocation	47	7133	August 21, 2023
Too many allocations in functions call General Usage	6	216	January 25, 2025
Allocation of new variables vs modification of existing variables New to Julia	11	2275	January 12, 2017
Allocation when applying in place multiplication to matrix of SVector Performance question	2	190	March 15, 2023

Extra allocation when trying to reuse pre-allocated vector

Related topics