Extra allocation when trying to reuse pre-allocated vector

Hi,

I am trying to generate data for logistic regression y (which is {0, 1}) from X and β. I already allocate the space for y and then wrote function to computing the same in two different ways. I then use TimerOutputs.jl to analyze the time and allocations.

Approach 1: Here the linear combination is first computed as z and is then used for generating samples.

function generate1!(y, X, β)
    @timeit "X * β" z = X * β 
    @timeit "p" p = logistic.(y)
    @timeit "y" y .= rand.(Bernoulli.(p))
    return y
end

Since, in the above approach, z is an auxiliary variable with new allocation (?), I thought of reusing y to compute the linear combination thinking it will save some allocations and hence time.
Approach 2: Here y is reused to store the intermediate result of X * β

function generate2!(y, X, β)
    @timeit "X * β" y .= X * β  # reuse already allocated "y"
    @timeit "p" p = logistic.(y)
    @timeit "y" y .= rand.(Bernoulli.(p))
    return y
end

Now I test it using the following code.

# Test data
X = randn(1000, 50)
β = rand(size(X, 2))
y = Vector(undef, size(X, 1))

# do precompile - both the test functions as well as timers
reset_timer!()
@timeit "" generate1!(y, X, β);
@timeit "" generate2!(y, X, β);

# Actual test
reset_timer!()
for i = 1:100
@timeit "generate1!()" generate1!(y, X, β);
@timeit "generate2!()" generate2!(y, X, β);
end
print_timer()

OUTPUT: After running, the second approach takes more time and more allocations.

 ───────────────────────────────────────────────────────────────────────
                                Time                   Allocations      
                        ──────────────────────   ───────────────────────
    Tot / % measured:        1.09s / 3.77%           83.5MiB / 8.28%    

 Section        ncalls     time   %tot     avg     alloc   %tot      avg
 ───────────────────────────────────────────────────────────────────────
 generate1!()      100   11.3ms  27.5%   113μs   1.55MiB  22.4%  15.9KiB
   z = X * β       100   2.72ms  6.62%  27.2μs    794KiB  11.2%  7.94KiB
   p               100   5.16ms  12.6%  51.6μs    794KiB  11.2%  7.94KiB
   y               100   3.17ms  7.72%  31.7μs     0.00B  0.00%    0.00B
 generate2!()      100   29.8ms  72.5%   298μs   5.37MiB  77.6%  55.0KiB
   y .= X * β      100   5.05ms  12.3%  50.5μs   2.30MiB  33.3%  23.6KiB   <<< Extra allocation even on reuse of "y"
   p               100   20.3ms  49.6%   203μs   3.06MiB  44.2%  31.3KiB   <<< Extra allocation; don't know why
   y               100   4.17ms  10.1%  41.7μs   7.81KiB  0.11%    80.0B   <<< Extra allocation; don't know why

 ───────────────────────────────────────────────────────────────────────

Why this is happening?

Because of the matrix multiplication. Use the in-place mul! to avoid that.

EDIT: The mul! is defined inside the LinearAlgebra module.

1 Like

This operation is not fully in-place. The * is not a broadcasted function, so this computes a new vector to hold the result of X * beta and then copies that vector into y. Matrix-vector multiplication is not an elementwise operation, so broadcasting isn’t really enough to save you from allocations here.

There’s a further problem which is that your y is a Vector{Any}. The performance cost of Vector{Any} completely dominates everything else in this example. You want Vector{Float64}(undef, size(X, 1)).

Fixing that, and using mul!, we can see the difference:

julia> function f1(y, A, B)
         y .= A * B
       end
f1 (generic function with 1 method)

julia> function f2(y, A, B)
         mul!(y, A, B)
       end
f2 (generic function with 1 method)

julia> @btime f1(y_any, $X, $β) setup=(y_any = Vector(undef, size($X, 1)));
  17.286 μs (1001 allocations: 23.56 KiB)

julia> @btime f1(y_concrete, $X, $β) setup=(y_concrete = Vector{Float64}(undef, size($X, 1)));
  6.816 μs (1 allocation: 7.94 KiB)

julia> @btime f2(y_concrete, $X, $β) setup=(y_concrete = Vector{Float64}(undef, size($X, 1)));
  5.798 μs (0 allocations: 0 bytes)
3 Likes

Hi @Ronis_BR

Using mul!() increased the allocations :open_mouth:
Code

function generate3!(y, X, β)
    @timeit "mul!(y, X, β)" mul!(y, X, β)  # reuse allocated "y"
    @timeit "p" p = logistic.(y)
    @timeit "y" y .= rand.(Bernoulli.(p))
end

Results

 ──────────────────────────────────────────────────────────────────────────
                                   Time                   Allocations      
                           ──────────────────────   ───────────────────────
     Tot / % measured:          1.61s / 33.3%            240MiB / 67.8%    

 Section           ncalls     time   %tot     avg     alloc   %tot      avg
 ──────────────────────────────────────────────────────────────────────────
 generate3!()         100    491ms  91.9%  4.91ms    156MiB  95.7%  1.56MiB
   mul!(y, X, β)      100    465ms  87.0%  4.65ms    153MiB  93.9%  1.53MiB   <<< here!!
   p                  100   20.1ms  3.76%   201μs   3.06MiB  1.88%  31.3KiB
   y                  100   5.64ms  1.06%  56.4μs   7.81KiB  0.00%    80.0B
 generate2!()         100   29.5ms  5.51%   295μs   5.37MiB  3.30%  55.0KiB
   p                  100   19.1ms  3.57%   191μs   3.06MiB  1.88%  31.3KiB
   y .= X * β         100   5.89ms  1.10%  58.9μs   2.30MiB  1.42%  23.6KiB
   y                  100   4.15ms  0.78%  41.5μs   7.81KiB  0.00%    80.0B
 generate1!()         100   13.7ms  2.56%   137μs   1.55MiB  0.96%  15.9KiB
   z = X * β          100   5.44ms  1.02%  54.4μs    794KiB  0.48%  7.94KiB
   p                  100   4.74ms  0.89%  47.4μs    794KiB  0.48%  7.94KiB
   y                  100   3.17ms  0.59%  31.7μs     0.00B  0.00%    0.00B
 ──────────────────────────────────────────────────────────────────────────

It’s because your y is a Vector{Any}. I’ve got more detail on this in my post above.

2 Likes

Hi @rdeits, Thanks a lot!. This helped.

y = Vector{eltype(X)}(undef, size(X, 1))

New results

 ──────────────────────────────────────────────────────────────────────────
                                   Time                   Allocations      
                           ──────────────────────   ───────────────────────
     Tot / % measured:          1.16s / 2.61%           81.0MiB / 4.79%    

 Section           ncalls     time   %tot     avg     alloc   %tot      avg
 ──────────────────────────────────────────────────────────────────────────
 generate2!()         100   10.4ms  34.1%   104μs   1.55MiB  40.0%  15.9KiB
   p                  100   5.20ms  17.1%  52.0μs    794KiB  20.0%  7.94KiB
   y                  100   3.07ms  10.1%  30.7μs     0.00B  0.00%    0.00B
   y .= X * β         100   1.92ms  6.32%  19.2μs    794KiB  20.0%  7.94KiB
 generate1!()         100   10.3ms  34.0%   103μs   1.55MiB  40.0%  15.9KiB
   p                  100   5.21ms  17.2%  52.1μs    794KiB  20.0%  7.94KiB
   y                  100   2.97ms  9.77%  29.7μs     0.00B  0.00%    0.00B
   z = X * β          100   1.95ms  6.41%  19.5μs    794KiB  20.0%  7.94KiB
 generate3!()         100   9.69ms  31.9%  96.9μs    796KiB  20.0%  7.96KiB   <<< Best!!
   p                  100   5.12ms  16.8%  51.2μs    794KiB  20.0%  7.94KiB
   y                  100   3.10ms  10.2%  31.0μs     0.00B  0.00%    0.00B
   mul!(y, X, β)      100   1.28ms  4.22%  12.8μs     0.00B  0.00%    0.00B
 ──────────────────────────────────────────────────────────────────────────
1 Like

Note that p = logistic.(y) also allocates a new vector p. Replacing p = logistic.(y); y .= rand.(Bernoulli.(p)) with y .= rand.(Bernoulli.(logistic.(y))) would remove more allocations, and likely give a bit more speedup.