Non-allocating A / B where A and B are matrices

Is there a non-allocating division for two matrices?
For example:

A = randn(10,10); B = randn(10,10);
test(A,B) = A / B
@benchmark test($A, $B)

Output:

BenchmarkTools.Trial: 
  memory estimate:  2.81 KiB
  allocs estimate:  7
  --------------
  minimum time:     9.443 μs (0.00% GC)
  median time:      11.608 μs (0.00% GC)
  mean time:        14.083 μs (12.47% GC)
  maximum time:     17.603 ms (99.76% GC)
  --------------
  samples:          10000
  evals/sample:     1

Unfortunately, StaticArrays is not an option as the matrices size can have a length of up to a 100 for each dimension. I need the division to calculate the Kalman gain in the measurement update step.
I know that there is a function rdiv! but I am struggling to decrease the amount of allocated memory. Any hint would be appreciated.

Please read the help, eg ?rdiv!, and if that is not enough provide a self-contained example of how you are trying to use it and what goes wrong.

Sure, I tried it with LU decomposition:

function test!(A, B)
    B_lu = lu!(B)
    rdiv!(A, UpperTriangular(B_lu.U))
    rdiv!(A, LowerTriangular(B_lu.L))
    view(A, :, map(x -> x[2], sort(map(reverse, enumerate(B_lu.p)))))
end
@benchmark test!($A,$B)

Output:

BenchmarkTools.Trial: 
  memory estimate:  2.89 KiB
  allocs estimate:  15
  --------------
  minimum time:     28.570 μs (0.00% GC)
  median time:      31.513 μs (0.00% GC)
  mean time:        34.639 μs (5.69% GC)
  maximum time:     19.760 ms (99.72% GC)
  --------------
  samples:          10000
  evals/sample:     1

Both functions return the same result, but test! allocates even more memory.

One idea (which probably isn’t worth it) is

function test_ldiv(A,B)     
    Ac = collect(adjoint(A))
    Bc = collect(adjoint(B))
    ldiv!(lu!(Bc), Ac)      
    adjoint(Ac)             
end

Benchmark:

julia> @benchmark test_ldiv($A, $B)
BenchmarkTools.Trial:
  memory estimate:  1.98 KiB
  allocs estimate:  7
  --------------
  minimum time:     12.400 μs (0.00% GC)
  median time:      15.200 μs (0.00% GC)
  mean time:        22.883 μs (0.00% GC)
  maximum time:     256.500 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark test($A, $B)
BenchmarkTools.Trial:
  memory estimate:  2.84 KiB
  allocs estimate:  7
  --------------
  minimum time:     13.399 μs (0.00% GC)
  median time:      18.200 μs (0.00% GC)
  mean time:        24.002 μs (4.30% GC)
  maximum time:     10.404 ms (99.21% GC)
  --------------
  samples:          10000
  evals/sample:     1
2 Likes

In this case, it does seem that the documentation for rdiv! is not particularly clear:

rdiv!(A, B)

Compute A / B in-place and overwriting A to store the result.

The argument B should not be a matrix. Rather, instead of matrices it should be a factorization object (e.g. produced by factorize or cholesky). The reason for this is that factorization itself is both expensive and typically allocates memory (although it can also be done in-place via, e.g., lu!), and performance-critical situations requiring rdiv! usually also require fine-grained control over the factorization of B.

For instance, the part

… instead of matrices it should be a factorization object (e.g. produced by factorize or cholesky) …

suggests that one could compute A / B by a call like rdiv!(A, factorize(B)) or rdiv!(A, lu!(B)); neither works, however, since there is no rdiv!(::Array, ::LU) method.
Additionally, the type signatures obtained from methods(rdiv!) are crazy-long, so it does actually seem to be a bit of challenge to get a good overview of this method.

3 Likes

I guess the question is if it’s worth it? As the documentation says, the factorization is expensive compared to the cost of allocating memory, so if you have to factorize for every division, you won’t gain a lot of performance by removing the allocations. Or do you have other reasons to reduce memory allocations, e.g. due to latency caused by GCs?

Building on @carstenbauer’s clever idea, we can create a version with almost no allocations by throwing in a temporary matrix:

"""
Compute A / B in-place and overwriting A to store the result.
Y is used as temporary storage.
"""
function rdiv_low_alloc!(A, B, Y)
    adjoint!(Y, A)
    adjoint!(A, B)
    ldiv!(lu!(A), Y)
    adjoint!(A, Y)
end

Sample usage:

julia> A,B = 1:2 .|>_-> rand(10, 10);

julia> Y = similar(A); # temporary buffer

julia> A / B == rdiv_low_alloc!(A, B, Y) # verify that it's correct
true

julia> @benchmark A / B evals = 1 setup = A,B = copy($A),copy($B)
BenchmarkTools.Trial:
  memory estimate:  2.84 KiB
  allocs estimate:  7
  --------------
  minimum time:     3.585 μs (0.00% GC)
  median time:      6.141 μs (0.00% GC)
  mean time:        11.326 μs (34.63% GC)
  maximum time:     39.242 ms (99.95% GC)

julia> @benchmark rdiv_low_alloc!(A, B, $Y) evals = 1 setup = A,B = copy($A),copy($B)
BenchmarkTools.Trial:
  memory estimate:  192 bytes
  allocs estimate:  2
  --------------
  minimum time:     3.290 μs (0.00% GC)
  median time:      5.274 μs (0.00% GC)
  mean time:        6.469 μs (0.00% GC)
  maximum time:     40.037 μs (0.00% GC)

As you can see, the biggest difference is in the maximum time, where the low-allocating version is 1000x faster than the regular version, since there’s no GC occurring.

As for the remaining 192 bytes, they seem to come from creating the LU object that we’re passing to ldiv!. Not sure if there’s an easy way to avoid those allocations too.

3 Likes

I found a way without the need of a temporary storage:

function rdiv_lu!(A, B)
    B_lu = lu!(B)
    rdiv!(rdiv!(A, UpperTriangular(B_lu.factors)), UnitLowerTriangular(B_lu.factors))
    _apply_inverse_ipiv_cols!(B_lu, A)
end

The function _apply_inverse_ipiv_cols! is the equivilant of _apply_inverse_ipiv but applied to columns instead of rows.
Benchmarks and correctnes:

julia> A,B = 1:2 .|>_-> rand(100, 100);

julia> A / B ≈ rdiv_lu!(A, B)
true

julia> @benchmark A / B setup = A,B = copy($A),copy($B)
BenchmarkTools.Trial: 
  memory estimate:  235.16 KiB
  allocs estimate:  10
  --------------
  minimum time:     279.562 μs (0.00% GC)
  median time:      284.536 μs (0.00% GC)
  mean time:        316.165 μs (1.47% GC)
  maximum time:     19.354 ms (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark rdiv_lu!(A, B) setup = A,B = copy($A),copy($B)
BenchmarkTools.Trial: 
  memory estimate:  528 bytes
  allocs estimate:  2
  --------------
  minimum time:     243.370 μs (0.00% GC)
  median time:      262.753 μs (0.00% GC)
  mean time:        279.792 μs (0.00% GC)
  maximum time:     14.217 ms (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

I implemented a rdiv! function which accepts a LU factorization object and made a pull request: Let rdiv! accept a LU factorization object by zsoerenm · Pull Request #31285 · JuliaLang/julia · GitHub You will find the definition of _apply_inverse_ipiv_cols! there.

3 Likes

Well, yes, at the cost of:

  • Destroying both arguments A and B
  • Using the internal method _apply_inverse_ipiv_cols! (which currently doesn’t exist; you added it in a PR to LinearAlgebra).

Using internal methods this way is usually not a good idea. Think of it this way: in the PR you just created, you changed the name of several internal methods (_ipiv!_ipiv_rows! etc). If anyone out there has code just like yours that use these internal methods, it will suddenly break once your PR is merged. Similarly, your code is likely to break the next time someone refactors this logic in lu.jl. So I would strongly recommend not using this approach.

If you don’t want the temporary, why not define your own in-place adjoint method?

function adj!(X)
    for i = 1:size(X,1), j = i:size(X,2)
        @inbounds X[i,j], X[j,i] = X[j,i]', X[i,j]'
    end
    return X
end

function rdiv_low_alloc!(A, B)
    ldiv!(lu!(adj!(B)), adj!(A))
    adj!(A)
end

It even seems a bit more efficient:

julia> A,B = 1:2 .|>_-> rand(100, 100);

julia> A / B == rdiv_low_alloc!(A, B) # verify that it's correct
true

julia> # rdiv with temporary:

julia> @btime rdiv_low_alloc!(A, B, Y) evals = 1 setup = A,B,Y = copy($A),copy($B),similar($A);
  258.882 μs (2 allocations: 928 bytes)

julia> # rdiv with internal method:

julia> @btime rdiv_lu!(A, B) evals = 1 setup = A,B = copy($A),copy($B);
  257.909 μs (2 allocations: 928 bytes)

julia> # rdiv with in-place adjoint:

julia> @btime rdiv_low_alloc!(A, B) evals = 1 setup = A,B = copy($A),copy($B);
  235.869 μs (2 allocations: 928 bytes)
3 Likes

That’s actually quite a neat way to do an in-place adjoint.

I am aware of that. I just used it here for the sake of simplicity. Note that with the proposed rdiv! method in the PR only the first argument is destroyed. The second argument must already be a LU factorization.
So when the PR gets accepted the following should work:

rdiv!(A, lu(B))