How to reduce large memory allocation by lu! with sparse matrices

In my code, I have to solve a large sparse linear system (14000 x 14000) many times to integrate a stiff system of ODE.
Since the memory allocation by the backslash operator \ is quite large, I thought of manually performing LU factorization with pre-allocated arrays.
However, a significant amount of memory allocation is still there and I wanted to know if there is a way to eliminate it.

A minimal example of this problem is the following:

julia> using LinearAlgebra;

julia> using SparseArrays;

julia> using SuiteSparse.UMFPACK;

julia> using BenchmarkTools;

julia> A = sparse(Float64[1.0 2.0; 0.0 3.0]);

julia> F = lu(A)
UmfpackLU{Float64, Int64}
L factor:
2×2 SparseMatrixCSC{Float64, Int64} with 2 stored entries:
 1.0   ⋅
  ⋅   1.0
U factor:
2×2 SparseMatrixCSC{Float64, Int64} with 3 stored entries:
 0.333333  0.666667
  ⋅        1.0

julia> B = sparse(Float64[1.0 1.0; 0.0 1.0]);

julia> function lu_loop!(F::UmfpackLU, B::SparseMatrixCSC)
           for _=1:2000
               lu!(F,B)
           end
       end
lu_loop! (generic function with 1 method)

julia> @btime lu_loop!(F,B)
  6.459 ms (72000 allocations: 4.23 MiB)

Now I think that pre-allocating F could eliminate the allocations:

julia> mutable struct Factor
           F::UmfpackLU
       end

julia> factor = Factor(F)
Factor(UmfpackLU{Float64, Int64}(Ptr{Nothing} @0x00007fd428caeea0, Ptr{Nothing} @0x00007fd3f8c1e9b0, 2, 2, [0, 1, 3], [0, 0, 1], [1.0, 1.0, 1.0], 0))

julia> @btime lu_loop!(factor.F,B)
  6.359 ms (72000 allocations: 4.23 MiB)

This does not seem to help at all (maybe I’m doing this wrong).

In my code, the allocation goes up to ~30GiB (I don’t know how it is even possible when I have 16GB of RAM) and I am wondering how I can avoid such a large amount of allocations.
If I can simply use the backslash operator \, that would be better for readability but it allocates 2-3 times more.
Any idea why this is happening? Any feedback would be very much appreciated.

1 Like

Is this the effect of fill-in? Meaning: zeros (not stored in the sparse matrix) become non-zeros during the factorization and those need to be stored.

1 Like

Edit: I think I misunderstood your reply. It is possible that the zeros become non-zeros during the factorization. But the problem is that I want the function to store them in the pre-allocated arrays rather than temporally allocating them. I thought that the mutable struct above could achieve it but it does not seem to be the case.


Thanks for the reply. I tried changing the sparsity pattern:

julia> A = sparse(Float64[1.0 2.0 0.0; 0.0 0.0 3.0]);

julia> F = lu(A)
UmfpackLU{Float64, Int64}
L factor:
2×2 SparseMatrixCSC{Float64, Int64} with 2 stored entries:
 1.0   ⋅
  ⋅   1.0
U factor:
2×3 SparseMatrixCSC{Float64, Int64} with 3 stored entries:
 0.333333   ⋅   0.666667
  ⋅        1.0   ⋅

julia> B = sparse(Float64[1.0 1.0 0.0; 0.0 0.0 1.0]);

julia> @btime lu_loop!(F,B)
  6.242 ms (74000 allocations: 4.46 MiB)
julia> A = sparse(Float64[1.0 2.0 1.0; 0.0 0.0 3.0]);

julia> F = lu(A)
UmfpackLU{Float64, Int64}
L factor:
2×2 SparseMatrixCSC{Float64, Int64} with 2 stored entries:
 1.0   ⋅
  ⋅   1.0
U factor:
2×3 SparseMatrixCSC{Float64, Int64} with 4 stored entries:
 0.25  0.25  0.5
  ⋅    1.0    ⋅

julia> B = sparse(Float64[1.0 1.0 1.0; 0.0 0.0 1.0]);

julia> @btime lu_loop!(F,B)
  6.374 ms (74000 allocations: 4.46 MiB)

But I don’t see any difference between them, if this comparison can test what you suggested.

You can perhaps have a look at IncompleteLU.
Also, you can perhaps use the LU decomposition as a preconditioner for an iterative solver

2 Likes

Thank your for your suggestion.
I haven’t thought of using an iterative solver but it will be worth a try.

In general I’m happy with the performance (speed) of lu! then using ldiv!, or just using \, for solving the linear system, though.
The only problem is the memory allocations…

the allocations count creation and garbage collection, so the count can be greater than the available RAM. It does not seem to allocate that much. 72000/2000 allocations per loop

2 Likes

First Remark

lu_loop!(factor.F,B) is completely equivalent to lu_loop!(F,B) because factor.F === F is true.

Second Remark

lu!(F, X) does not eliminate the allocations, but can reduce them.

using BenchmarkTools
using Random; Random.seed!(4649373)
using LinearAlgebra
using SparseArrays
using SparseArrays: getcolptr
using SuiteSparse: decrement
using SuiteSparse.UMFPACK: UmfpackLU, UMFVTypes, UMFITypes, umfpack_numeric!

"""A modification of https://github.com/JuliaLang/SuiteSparse.jl/blob/master/src/umfpack.jl#L220-L278"""
function lu_dot!(F::UmfpackLU, S::SparseMatrixCSC{<:UMFVTypes,<:UMFITypes}; check::Bool=true)
    zerobased = getcolptr(S)[1] == 0
    F.m = size(S, 1)
    F.n = size(S, 2)
    if zerobased
        F.colptr .= getcolptr(S)
        F.rowval .= rowvals(S)
    else
        F.colptr .= getcolptr(S) .- oneunit(eltype(S))
        F.rowval .= rowvals(S) .- oneunit(eltype(S))
    end
    F.nzval .= nonzeros(S)

    umfpack_numeric!(F, reuse_numeric = false)
    check && (issuccess(F) || throw(LinearAlgebra.SingularException(0)))
    return F
end

n = 2^10
X = I + sprandn(n, n, 1e-3)

F = lu(X)
@show lu(X).L == lu!(F, X).L
@show lu(X).U == lu!(F, X).U
@show lu!(F, X) == lu_dot!(F, X)
println()
print("lu(X):        ")
@btime lu($X)
print("lu!(F, X):    ")
@btime lu!($F, $X)
print("lu_dot!(F, X):")
@btime lu_dot!($F, $X);

Output:

(lu(X)).L == (lu!(F, X)).L = true
(lu(X)).U == (lu!(F, X)).U = true
lu!(F, X) == lu_dot!(F, X) = true

lu(X):          141.900 μs (61 allocations: 505.51 KiB)
lu!(F, X):      71.200 μs (36 allocations: 220.17 KiB)
lu_dot!(F, X):  73.000 μs (31 allocations: 179.58 KiB)

Here, lu_dot!(F, X) is a modification of lu!(F, X) to reduce allocations with dot-syntax.

As a result, we have reduced 505.51 KiB to 179.58 KiB. The remaining 179.58 KiB is allocated by umfpack_numeric(F; reuse_numeric), the essential part of lu!(F, X).

print("umfpack_numeric!(F, reuse_numeric = false):")
let F = F, S = X
    zerobased = getcolptr(S)[1] == 0
    F.m = size(S, 1)
    F.n = size(S, 2)
    if zerobased
        F.colptr .= getcolptr(S)
        F.rowval .= rowvals(S)
    else
        F.colptr .= getcolptr(S) .- oneunit(eltype(S))
        F.rowval .= rowvals(S) .- oneunit(eltype(S))
    end
    F.nzval .= nonzeros(S)

    @btime umfpack_numeric!(F, reuse_numeric = false)
end;

Output:

umfpack_numeric!(F, reuse_numeric = false):  68.300 μs (31 allocations: 179.58 KiB)

This is all I could do at the moment.

5 Likes

Thank you very much Dr. Kuroki, for your detailed explanation.

It’s good to know that the memory allocation was not because of my code, but mainly due to UMFPACK itself.
As @rveltz pointed out, the amount of allocation is not prohibitive per loop, so probably I can live with this.
I will definitely try your lu_dot!(F, X) function and see the actual difference in my code. There is also a lot to learn from how you implemented this.

Thank you very much, it makes sense!

Judging from the amount of allocation in my actual code, I could probably live with this.
But in my use case, I do have to know the peak memory usage (since I will run this on a cluster), so I will still have to figure out how to find or limit the peak memory usage.