Memory allocation, left-division operator

Why does the left-division operator allocates comparatively that much memory?

# Data
x = [ones(1000) rand(1000)]
y = rand(1000)

# Benchmark I (left-divison)
@btime ($x \ $y)
  4.089 μs (29 allocations: 91.67 KiB)
2-element Vector{Float64}:
 0.4780493038672541
 0.03690626005679365

# Benchmark II (OLS formula)
@btime (inv($x'*$x)*x'*y)
  2.143 μs (8 allocations: 1.56 KiB)
2-element Vector{Float64}:
 0.478049303867256
 0.03690626005679073

# Benchmark III (using mul!())
xtimesx = zeros(size(x, 2), size(x, 2))
xtimesy = zeros(size(x, 2))
@btime inv(mul!($xtimesx, $x', $x))*mul!($xtimesy, $x', $y)
  2.201 μs (5 allocations: 1.38 KiB)
2-element Vector{Float64}:
 0.478049303867256
 0.03690626005679073

Is there a way to get the results without any allocation? I am aware of this discussion, but I could not find a discussion about the subject regarding allocations.

Is this another thread about

?

1 Like

Because it does pivoted QR, which is more expensive (but more stable) than the normal-equations approach (what you are calling the “OLS formula”). In particular, when minimizing \Vert Ax - b\Vert, the QR method needs to allocate storage comparable in size to the original matrix A, whereas the normal-equation approach needs allocations only proportional to the size of the much smaller matrix A^T A. See also:

Note, by the way, that inv(x'*x) * x' * y is suboptimal in multiple respects. (Besides the potential inaccuracy compared to QR if x is badly conditioned.) First, you don’t really need to invert the matrix. You can use (x'x) \ ... instead, or even cholesky(Hermitian(x'x)) \ ... to explicitly tell Julia to use a Cholesky factorization since x'x is SPD. Second, beware that multiplication in Julia is left-associative, so inv(x'*x) * x' * y is really computing (inv(x'*x) * x') * y, which requires an inefficient matrix–matrix product, rather than inv(x'*x) * (x' * y) that involves a matrix–vector product. Parentheses matter (for performance) in matrix expressions!

However, in large problems the cost of the normal-equations approach is usually dominated by the cost of computing A^T A.

As you noticed, the first step is to pre-allocate matrices for intermediate computations. Use mul! for pre-allocated multiplications, and ldiv! for pre-allocated backslash. Don’t use inv, as commented above, use a factorization object with pre-allocated storage.

For example, in the normal-equations approach, we can use cholesky! to do everything with zero allocations (after pre-allocation):

m, n = 1000, 10
A = rand(m, n); b = rand(m); # input data

# preallocate outputs and temporaries:
AᵀA = similar(A, n, n);
Aᵀb = similar(b, n);
x = similar(b, n);

which gives:

julia> @btime ldiv!($x, cholesky!(Hermitian(mul!($AᵀA, $A', $A))),
                        mul!($Aᵀb, $A', $b));
  14.026 μs (0 allocations: 0 bytes)

julia> x ≈ A \ b
true

As mentioned in the link above, you should be cautious about using the normal equations if A might be badly conditioned; that’s why it’s not the default for A \ b.

Doing the QR approach in-place is harder. There is a qr! function that can exploit pre-allocated output, but it’s not completely allocation-free:

julia> Atmp = similar(A);

julia> @btime qr!(copyto!($Atmp, $A), ColumnNorm()); # pivoted QR
  30.034 μs (4 allocations: 3.34 KiB)

The reason for this is that LAPACK’s QR algorithm requires additional working storage beyond the storage for the output factorization, and Julia’s high-level interface currently provides no way to pre-allocate that additional storage. The only way to do it completely pre-allocated would be to call the low-level LAPACK interface.

However, unless you are solving rather tiny least-squares problems, the allocations are usually not such a dominant factor in performance anyway. For example, if you go without pre-allocating the qr output in the example above, the time increases by < 10\%:

julia> @btime qr($A, ColumnNorm()); # pivoted QR, without pre-allocation
  32.279 μs (6 allocations: 81.52 KiB)

and the time for the normal-equations approach is virtually unchanged without pre-allocation in this example:

julia> @btime cholesky!(Hermitian($A'*$A)) \ ($A'*$b);
  14.104 μs (3 allocations: 1.16 KiB)

A lot more readable!

5 Likes

you probably should be using LinearSolve.jl here. it makes it easy to reuse the caches

1 Like

An other option might be FastLapackInterface.jl

2 Likes

Thank you very much for this great and detailed explanation! And thanks for pointing out the “wrong” parentheses. Below are the updated benchmarks for these cases:

# Data
x = [ones(1000) rand(1000)]
y = rand(1000)

# matrix-matrix product
@btime (inv($x'*$x)*x'*y)
  2.250 μs (8 allocations: 1.56 KiB)
2-element Vector{Float64}:
  0.49744667725473835
 -0.005938826891960122

# matrix-vector product
@btime inv($x'*$x)*($x'*$y)
  2.218 μs (7 allocations: 1.55 KiB)
2-element Vector{Float64}:
  0.49744667725473835
 -0.005938826891960122

# matrix-vector product (without inversion)
@btime ($x'*$x)\($x'*$y)
  2.087 μs (5 allocations: 432 bytes)
2-element Vector{Float64}:
  0.49744667725473823
 -0.005938826891960073