Memory allocation, left-division operator

AdaemmerP · August 23, 2023, 1:07pm

Why does the left-division operator allocates comparatively that much memory?

# Data
x = [ones(1000) rand(1000)]
y = rand(1000)

# Benchmark I (left-divison)
@btime ($x \ $y)
  4.089 μs (29 allocations: 91.67 KiB)
2-element Vector{Float64}:
 0.4780493038672541
 0.03690626005679365

# Benchmark II (OLS formula)
@btime (inv($x'*$x)*x'*y)
  2.143 μs (8 allocations: 1.56 KiB)
2-element Vector{Float64}:
 0.478049303867256
 0.03690626005679073

# Benchmark III (using mul!())
xtimesx = zeros(size(x, 2), size(x, 2))
xtimesy = zeros(size(x, 2))
@btime inv(mul!($xtimesx, $x', $x))*mul!($xtimesy, $x', $y)
  2.201 μs (5 allocations: 1.38 KiB)
2-element Vector{Float64}:
 0.478049303867256
 0.03690626005679073

Is there a way to get the results without any allocation? I am aware of this discussion, but I could not find a discussion about the subject regarding allocations.

nilshg · August 23, 2023, 1:24pm

Is this another thread about

?

stevengj · August 23, 2023, 1:51pm

Because it does pivoted QR, which is more expensive (but more stable) than the normal-equations approach (what you are calling the “OLS formula”). In particular, when minimizing \Vert Ax - b\Vert, the QR method needs to allocate storage comparable in size to the original matrix A, whereas the normal-equation approach needs allocations only proportional to the size of the much smaller matrix A^T A. See also:

Note, by the way, that inv(x'*x) * x' * y is suboptimal in multiple respects. (Besides the potential inaccuracy compared to QR if x is badly conditioned.) First, you don’t really need to invert the matrix. You can use (x'x) \ ... instead, or even cholesky(Hermitian(x'x)) \ ... to explicitly tell Julia to use a Cholesky factorization since x'x is SPD. Second, beware that multiplication in Julia is left-associative, so inv(x'*x) * x' * y is really computing (inv(x'*x) * x') * y, which requires an inefficient matrix–matrix product, rather than inv(x'*x) * (x' * y) that involves a matrix–vector product. Parentheses matter (for performance) in matrix expressions!

However, in large problems the cost of the normal-equations approach is usually dominated by the cost of computing A^T A.

As you noticed, the first step is to pre-allocate matrices for intermediate computations. Use mul! for pre-allocated multiplications, and ldiv! for pre-allocated backslash. Don’t use inv, as commented above, use a factorization object with pre-allocated storage.

For example, in the normal-equations approach, we can use cholesky! to do everything with zero allocations (after pre-allocation):

m, n = 1000, 10
A = rand(m, n); b = rand(m); # input data

# preallocate outputs and temporaries:
AᵀA = similar(A, n, n);
Aᵀb = similar(b, n);
x = similar(b, n);

which gives:

julia> @btime ldiv!($x, cholesky!(Hermitian(mul!($AᵀA, $A', $A))),
                        mul!($Aᵀb, $A', $b));
  14.026 μs (0 allocations: 0 bytes)

julia> x ≈ A \ b
true

As mentioned in the link above, you should be cautious about using the normal equations if A might be badly conditioned; that’s why it’s not the default for A \ b.

Doing the QR approach in-place is harder. There is a qr! function that can exploit pre-allocated output, but it’s not completely allocation-free:

julia> Atmp = similar(A);

julia> @btime qr!(copyto!($Atmp, $A), ColumnNorm()); # pivoted QR
  30.034 μs (4 allocations: 3.34 KiB)

The reason for this is that LAPACK’s QR algorithm requires additional working storage beyond the storage for the output factorization, and Julia’s high-level interface currently provides no way to pre-allocate that additional storage. The only way to do it completely pre-allocated would be to call the low-level LAPACK interface.

However, unless you are solving rather tiny least-squares problems, the allocations are usually not such a dominant factor in performance anyway. For example, if you go without pre-allocating the qr output in the example above, the time increases by < 10\%:

julia> @btime qr($A, ColumnNorm()); # pivoted QR, without pre-allocation
  32.279 μs (6 allocations: 81.52 KiB)

and the time for the normal-equations approach is virtually unchanged without pre-allocation in this example:

julia> @btime cholesky!(Hermitian($A'*$A)) \ ($A'*$b);
  14.104 μs (3 allocations: 1.16 KiB)

A lot more readable!

Oscar_Smith · August 23, 2023, 2:15pm

you probably should be using LinearSolve.jl here. it makes it easy to reuse the caches

ffevotte · August 23, 2023, 2:21pm

An other option might be FastLapackInterface.jl

AdaemmerP · August 24, 2023, 8:50am

Thank you very much for this great and detailed explanation! And thanks for pointing out the “wrong” parentheses. Below are the updated benchmarks for these cases:

# Data
x = [ones(1000) rand(1000)]
y = rand(1000)

# matrix-matrix product
@btime (inv($x'*$x)*x'*y)
  2.250 μs (8 allocations: 1.56 KiB)
2-element Vector{Float64}:
  0.49744667725473835
 -0.005938826891960122

# matrix-vector product
@btime inv($x'*$x)*($x'*$y)
  2.218 μs (7 allocations: 1.55 KiB)
2-element Vector{Float64}:
  0.49744667725473835
 -0.005938826891960122

# matrix-vector product (without inversion)
@btime ($x'*$x)\($x'*$y)
  2.087 μs (5 allocations: 432 bytes)
2-element Vector{Float64}:
  0.49744667725473823
 -0.005938826891960073

Topic		Replies	Views
Non-allocating A / B where A and B are matrices General Usage	8	1746	March 7, 2019
Solve a system of linear equations many times without allocating memory Numerics	18	2432	April 7, 2020
Solving linear system without allocating Numerics	19	499	August 9, 2025
Avoiding memory allocation for solves with multiple right-hand sides Performance linearalgebra , memory-allocation	7	1289	September 10, 2021
Help to reduce number of allocations General Usage	23	5490	January 6, 2019

Memory allocation, left-division operator

Related topics