 # Extra allocation when trying to reuse pre-allocated vector

Hi,

I am trying to generate data for logistic regression `y` (which is {0, 1}) from `X` and `β`. I already allocate the space for `y ` and then wrote function to computing the same in two different ways. I then use `TimerOutputs.jl` to analyze the time and allocations.

Approach 1: Here the linear combination is first computed as `z` and is then used for generating samples.

``````function generate1!(y, X, β)
@timeit "X * β" z = X * β
@timeit "p" p = logistic.(y)
@timeit "y" y .= rand.(Bernoulli.(p))
return y
end
``````

Since, in the above approach, `z` is an auxiliary variable with new allocation (?), I thought of reusing `y` to compute the linear combination thinking it will save some allocations and hence time.
Approach 2: Here `y` is reused to store the intermediate result of `X * β`

``````function generate2!(y, X, β)
@timeit "X * β" y .= X * β  # reuse already allocated "y"
@timeit "p" p = logistic.(y)
@timeit "y" y .= rand.(Bernoulli.(p))
return y
end
``````

Now I test it using the following code.

``````# Test data
X = randn(1000, 50)
β = rand(size(X, 2))
y = Vector(undef, size(X, 1))

# do precompile - both the test functions as well as timers
reset_timer!()
@timeit "" generate1!(y, X, β);
@timeit "" generate2!(y, X, β);

# Actual test
reset_timer!()
for i = 1:100
@timeit "generate1!()" generate1!(y, X, β);
@timeit "generate2!()" generate2!(y, X, β);
end
print_timer()
``````

OUTPUT: After running, the second approach takes more time and more allocations.

`````` ───────────────────────────────────────────────────────────────────────
Time                   Allocations
──────────────────────   ───────────────────────
Tot / % measured:        1.09s / 3.77%           83.5MiB / 8.28%

Section        ncalls     time   %tot     avg     alloc   %tot      avg
───────────────────────────────────────────────────────────────────────
generate1!()      100   11.3ms  27.5%   113μs   1.55MiB  22.4%  15.9KiB
z = X * β       100   2.72ms  6.62%  27.2μs    794KiB  11.2%  7.94KiB
p               100   5.16ms  12.6%  51.6μs    794KiB  11.2%  7.94KiB
y               100   3.17ms  7.72%  31.7μs     0.00B  0.00%    0.00B
generate2!()      100   29.8ms  72.5%   298μs   5.37MiB  77.6%  55.0KiB
y .= X * β      100   5.05ms  12.3%  50.5μs   2.30MiB  33.3%  23.6KiB   <<< Extra allocation even on reuse of "y"
p               100   20.3ms  49.6%   203μs   3.06MiB  44.2%  31.3KiB   <<< Extra allocation; don't know why
y               100   4.17ms  10.1%  41.7μs   7.81KiB  0.11%    80.0B   <<< Extra allocation; don't know why

───────────────────────────────────────────────────────────────────────
``````

Why this is happening?

Because of the matrix multiplication. Use the in-place `mul!` to avoid that.

EDIT: The `mul!` is defined inside the `LinearAlgebra` module.

1 Like

This operation is not fully in-place. The `*` is not a broadcasted function, so this computes a new vector to hold the result of X * beta and then copies that vector into `y`. Matrix-vector multiplication is not an elementwise operation, so broadcasting isn’t really enough to save you from allocations here.

There’s a further problem which is that your `y` is a `Vector{Any}`. The performance cost of `Vector{Any}` completely dominates everything else in this example. You want `Vector{Float64}(undef, size(X, 1))`.

Fixing that, and using `mul!`, we can see the difference:

``````julia> function f1(y, A, B)
y .= A * B
end
f1 (generic function with 1 method)

julia> function f2(y, A, B)
mul!(y, A, B)
end
f2 (generic function with 1 method)

julia> @btime f1(y_any, \$X, \$β) setup=(y_any = Vector(undef, size(\$X, 1)));
17.286 μs (1001 allocations: 23.56 KiB)

julia> @btime f1(y_concrete, \$X, \$β) setup=(y_concrete = Vector{Float64}(undef, size(\$X, 1)));
6.816 μs (1 allocation: 7.94 KiB)

julia> @btime f2(y_concrete, \$X, \$β) setup=(y_concrete = Vector{Float64}(undef, size(\$X, 1)));
5.798 μs (0 allocations: 0 bytes)
``````
3 Likes

Using `mul!()` increased the allocations Code

``````function generate3!(y, X, β)
@timeit "mul!(y, X, β)" mul!(y, X, β)  # reuse allocated "y"
@timeit "p" p = logistic.(y)
@timeit "y" y .= rand.(Bernoulli.(p))
end
``````

Results

`````` ──────────────────────────────────────────────────────────────────────────
Time                   Allocations
──────────────────────   ───────────────────────
Tot / % measured:          1.61s / 33.3%            240MiB / 67.8%

Section           ncalls     time   %tot     avg     alloc   %tot      avg
──────────────────────────────────────────────────────────────────────────
generate3!()         100    491ms  91.9%  4.91ms    156MiB  95.7%  1.56MiB
mul!(y, X, β)      100    465ms  87.0%  4.65ms    153MiB  93.9%  1.53MiB   <<< here!!
p                  100   20.1ms  3.76%   201μs   3.06MiB  1.88%  31.3KiB
y                  100   5.64ms  1.06%  56.4μs   7.81KiB  0.00%    80.0B
generate2!()         100   29.5ms  5.51%   295μs   5.37MiB  3.30%  55.0KiB
p                  100   19.1ms  3.57%   191μs   3.06MiB  1.88%  31.3KiB
y .= X * β         100   5.89ms  1.10%  58.9μs   2.30MiB  1.42%  23.6KiB
y                  100   4.15ms  0.78%  41.5μs   7.81KiB  0.00%    80.0B
generate1!()         100   13.7ms  2.56%   137μs   1.55MiB  0.96%  15.9KiB
z = X * β          100   5.44ms  1.02%  54.4μs    794KiB  0.48%  7.94KiB
p                  100   4.74ms  0.89%  47.4μs    794KiB  0.48%  7.94KiB
y                  100   3.17ms  0.59%  31.7μs     0.00B  0.00%    0.00B
──────────────────────────────────────────────────────────────────────────
``````

It’s because your `y` is a `Vector{Any}`. I’ve got more detail on this in my post above.

2 Likes

Hi @rdeits, Thanks a lot!. This helped.

``````y = Vector{eltype(X)}(undef, size(X, 1))
``````

New results

`````` ──────────────────────────────────────────────────────────────────────────
Time                   Allocations
──────────────────────   ───────────────────────
Tot / % measured:          1.16s / 2.61%           81.0MiB / 4.79%

Section           ncalls     time   %tot     avg     alloc   %tot      avg
──────────────────────────────────────────────────────────────────────────
generate2!()         100   10.4ms  34.1%   104μs   1.55MiB  40.0%  15.9KiB
p                  100   5.20ms  17.1%  52.0μs    794KiB  20.0%  7.94KiB
y                  100   3.07ms  10.1%  30.7μs     0.00B  0.00%    0.00B
y .= X * β         100   1.92ms  6.32%  19.2μs    794KiB  20.0%  7.94KiB
generate1!()         100   10.3ms  34.0%   103μs   1.55MiB  40.0%  15.9KiB
p                  100   5.21ms  17.2%  52.1μs    794KiB  20.0%  7.94KiB
y                  100   2.97ms  9.77%  29.7μs     0.00B  0.00%    0.00B
z = X * β          100   1.95ms  6.41%  19.5μs    794KiB  20.0%  7.94KiB
generate3!()         100   9.69ms  31.9%  96.9μs    796KiB  20.0%  7.96KiB   <<< Best!!
p                  100   5.12ms  16.8%  51.2μs    794KiB  20.0%  7.94KiB
y                  100   3.10ms  10.2%  31.0μs     0.00B  0.00%    0.00B
mul!(y, X, β)      100   1.28ms  4.22%  12.8μs     0.00B  0.00%    0.00B
──────────────────────────────────────────────────────────────────────────
``````
1 Like

Note that `p = logistic.(y)` also allocates a new vector `p`. Replacing `p = logistic.(y); y .= rand.(Bernoulli.(p))` with `y .= rand.(Bernoulli.(logistic.(y)))` would remove more allocations, and likely give a bit more speedup.