How to reduce large memory allocation by lu! with sparse matrices

YukiyaSaito · July 17, 2021, 2:39am

In my code, I have to solve a large sparse linear system (14000 x 14000) many times to integrate a stiff system of ODE.
Since the memory allocation by the backslash operator \ is quite large, I thought of manually performing LU factorization with pre-allocated arrays.
However, a significant amount of memory allocation is still there and I wanted to know if there is a way to eliminate it.

A minimal example of this problem is the following:

julia> using LinearAlgebra;

julia> using SparseArrays;

julia> using SuiteSparse.UMFPACK;

julia> using BenchmarkTools;

julia> A = sparse(Float64[1.0 2.0; 0.0 3.0]);

julia> F = lu(A)
UmfpackLU{Float64, Int64}
L factor:
2×2 SparseMatrixCSC{Float64, Int64} with 2 stored entries:
 1.0   ⋅
  ⋅   1.0
U factor:
2×2 SparseMatrixCSC{Float64, Int64} with 3 stored entries:
 0.333333  0.666667
  ⋅        1.0

julia> B = sparse(Float64[1.0 1.0; 0.0 1.0]);

julia> function lu_loop!(F::UmfpackLU, B::SparseMatrixCSC)
           for _=1:2000
               lu!(F,B)
           end
       end
lu_loop! (generic function with 1 method)

julia> @btime lu_loop!(F,B)
  6.459 ms (72000 allocations: 4.23 MiB)

Now I think that pre-allocating F could eliminate the allocations:

julia> mutable struct Factor
           F::UmfpackLU
       end

julia> factor = Factor(F)
Factor(UmfpackLU{Float64, Int64}(Ptr{Nothing} @0x00007fd428caeea0, Ptr{Nothing} @0x00007fd3f8c1e9b0, 2, 2, [0, 1, 3], [0, 0, 1], [1.0, 1.0, 1.0], 0))

julia> @btime lu_loop!(factor.F,B)
  6.359 ms (72000 allocations: 4.23 MiB)

This does not seem to help at all (maybe I’m doing this wrong).

In my code, the allocation goes up to ~30GiB (I don’t know how it is even possible when I have 16GB of RAM) and I am wondering how I can avoid such a large amount of allocations.
If I can simply use the backslash operator \, that would be better for readability but it allocates 2-3 times more.
Any idea why this is happening? Any feedback would be very much appreciated.

PetrKryslUCSD · July 17, 2021, 3:09am

Is this the effect of fill-in? Meaning: zeros (not stored in the sparse matrix) become non-zeros during the factorization and those need to be stored.

YukiyaSaito · July 17, 2021, 3:23am

Edit: I think I misunderstood your reply. It is possible that the zeros become non-zeros during the factorization. But the problem is that I want the function to store them in the pre-allocated arrays rather than temporally allocating them. I thought that the mutable struct above could achieve it but it does not seem to be the case.

Thanks for the reply. I tried changing the sparsity pattern:

julia> A = sparse(Float64[1.0 2.0 0.0; 0.0 0.0 3.0]);

julia> F = lu(A)
UmfpackLU{Float64, Int64}
L factor:
2×2 SparseMatrixCSC{Float64, Int64} with 2 stored entries:
 1.0   ⋅
  ⋅   1.0
U factor:
2×3 SparseMatrixCSC{Float64, Int64} with 3 stored entries:
 0.333333   ⋅   0.666667
  ⋅        1.0   ⋅

julia> B = sparse(Float64[1.0 1.0 0.0; 0.0 0.0 1.0]);

julia> @btime lu_loop!(F,B)
  6.242 ms (74000 allocations: 4.46 MiB)

julia> A = sparse(Float64[1.0 2.0 1.0; 0.0 0.0 3.0]);

julia> F = lu(A)
UmfpackLU{Float64, Int64}
L factor:
2×2 SparseMatrixCSC{Float64, Int64} with 2 stored entries:
 1.0   ⋅
  ⋅   1.0
U factor:
2×3 SparseMatrixCSC{Float64, Int64} with 4 stored entries:
 0.25  0.25  0.5
  ⋅    1.0    ⋅

julia> B = sparse(Float64[1.0 1.0 1.0; 0.0 0.0 1.0]);

julia> @btime lu_loop!(F,B)
  6.374 ms (74000 allocations: 4.46 MiB)

But I don’t see any difference between them, if this comparison can test what you suggested.

rveltz · July 17, 2021, 5:48am

You can perhaps have a look at IncompleteLU.
Also, you can perhaps use the LU decomposition as a preconditioner for an iterative solver

YukiyaSaito · July 17, 2021, 6:29am

Thank your for your suggestion.
I haven’t thought of using an iterative solver but it will be worth a try.

In general I’m happy with the performance (speed) of lu! then using ldiv!, or just using \, for solving the linear system, though.
The only problem is the memory allocations…

rveltz · July 17, 2021, 10:07am

the allocations count creation and garbage collection, so the count can be greater than the available RAM. It does not seem to allocate that much. 72000/2000 allocations per loop

genkuroki · July 17, 2021, 10:40am

First Remark

lu_loop!(factor.F,B) is completely equivalent to lu_loop!(F,B) because factor.F === F is true.

Second Remark

lu!(F, X) does not eliminate the allocations, but can reduce them.

using BenchmarkTools
using Random; Random.seed!(4649373)
using LinearAlgebra
using SparseArrays
using SparseArrays: getcolptr
using SuiteSparse: decrement
using SuiteSparse.UMFPACK: UmfpackLU, UMFVTypes, UMFITypes, umfpack_numeric!

"""A modification of https://github.com/JuliaLang/SuiteSparse.jl/blob/master/src/umfpack.jl#L220-L278"""
function lu_dot!(F::UmfpackLU, S::SparseMatrixCSC{<:UMFVTypes,<:UMFITypes}; check::Bool=true)
    zerobased = getcolptr(S)[1] == 0
    F.m = size(S, 1)
    F.n = size(S, 2)
    if zerobased
        F.colptr .= getcolptr(S)
        F.rowval .= rowvals(S)
    else
        F.colptr .= getcolptr(S) .- oneunit(eltype(S))
        F.rowval .= rowvals(S) .- oneunit(eltype(S))
    end
    F.nzval .= nonzeros(S)

    umfpack_numeric!(F, reuse_numeric = false)
    check && (issuccess(F) || throw(LinearAlgebra.SingularException(0)))
    return F
end

n = 2^10
X = I + sprandn(n, n, 1e-3)

F = lu(X)
@show lu(X).L == lu!(F, X).L
@show lu(X).U == lu!(F, X).U
@show lu!(F, X) == lu_dot!(F, X)
println()
print("lu(X):        ")
@btime lu($X)
print("lu!(F, X):    ")
@btime lu!($F, $X)
print("lu_dot!(F, X):")
@btime lu_dot!($F, $X);

Output:

(lu(X)).L == (lu!(F, X)).L = true
(lu(X)).U == (lu!(F, X)).U = true
lu!(F, X) == lu_dot!(F, X) = true

lu(X):          141.900 μs (61 allocations: 505.51 KiB)
lu!(F, X):      71.200 μs (36 allocations: 220.17 KiB)
lu_dot!(F, X):  73.000 μs (31 allocations: 179.58 KiB)

Here, lu_dot!(F, X) is a modification of lu!(F, X) to reduce allocations with dot-syntax.

As a result, we have reduced 505.51 KiB to 179.58 KiB. The remaining 179.58 KiB is allocated by umfpack_numeric(F; reuse_numeric), the essential part of lu!(F, X).

print("umfpack_numeric!(F, reuse_numeric = false):")
let F = F, S = X
    zerobased = getcolptr(S)[1] == 0
    F.m = size(S, 1)
    F.n = size(S, 2)
    if zerobased
        F.colptr .= getcolptr(S)
        F.rowval .= rowvals(S)
    else
        F.colptr .= getcolptr(S) .- oneunit(eltype(S))
        F.rowval .= rowvals(S) .- oneunit(eltype(S))
    end
    F.nzval .= nonzeros(S)

    @btime umfpack_numeric!(F, reuse_numeric = false)
end;

Output:

umfpack_numeric!(F, reuse_numeric = false):  68.300 μs (31 allocations: 179.58 KiB)

This is all I could do at the moment.

YukiyaSaito · July 17, 2021, 9:07pm

Thank you very much Dr. Kuroki, for your detailed explanation.

It’s good to know that the memory allocation was not because of my code, but mainly due to UMFPACK itself.
As @rveltz pointed out, the amount of allocation is not prohibitive per loop, so probably I can live with this.
I will definitely try your lu_dot!(F, X) function and see the actual difference in my code. There is also a lot to learn from how you implemented this.

YukiyaSaito · July 17, 2021, 9:12pm

Thank you very much, it makes sense!

Judging from the amount of allocation in my actual code, I could probably live with this.
But in my use case, I do have to know the peak memory usage (since I will run this on a cluster), so I will still have to figure out how to find or limit the peak memory usage.

Topic		Replies	Views
Solving linear system without allocating Numerics	14	292	June 29, 2025
Solving a linear system without modifying and allocating arrays Performance	3	126	March 25, 2025
Getting to zero allocations in NonlinearSolve.jl Performance nlsolve , nonlinearsolve	9	266	November 29, 2024
In place lufact for sparse matrices and prellocation for lufact output Numerics performance , linearalgebra	1	606	September 26, 2017
Solve a system of linear equations many times without allocating memory Numerics	18	2424	April 7, 2020

How to reduce large memory allocation by lu! with sparse matrices

Related topics