OutOfMemoryError with sparse A'*A

I have to multiply large sparse matrices in the following orthogonal projection

orthProject(A::SparseMatricCSC, v::SparseVector) = A * ((A' * A) \ Vector(A' * v))

but due I got this error

ERROR: OutOfMemoryError()
Stacktrace:
 [1] Array
   @ .\boot.jl:448 [inlined]
 [2] Array
   @ .\boot.jl:457 [inlined]
 [3] ones
   @ .\array.jl:503 [inlined]
 [4] ones
   @ .\array.jl:499 [inlined]
 [5] ftranspose(A::SparseMatrixCSC{Float64, Int64}, f::Function, eltype::Type{Float64})
   @ SparseArrays C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.6\SparseArrays\src\sparsematrix.jl:1056
 [6] copy
   @ C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.6\SparseArrays\src\sparsematrix.jl:1064 [inlined]
 [7] *(A::SparseMatrixCSC{Float64, Int64}, B::Adjoint{Float64, SparseMatrixCSC{Float64, Int64}})
   @ SparseArrays C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.6\SparseArrays\src\linalg.jl:170
 [8] top-level scope
   @ none:1

Do you have a fix for this?

For one, don’t use the normal equations. This is just A\v which will do the same thing, but be faster, use less memory, and be more accurate.

6 Likes

It would also help if you shared your matrix sizes. Note that \textbf{A}^T \textbf{A} is in general much less sparse than \textbf{A} unless you expect some serious cancellation:

using SparseArrays
A = sprandn(100,100,0.1)
@show nnz(A)/prod(size(A))    # about 0.1
@show nnz(A'A)/prod(size(A)) # ...not about 0.1. Happened to be 0.65 on my sample.

so that might literally be the source of your error.

2 Likes

@Oscar_Smith I am not familiar with the term β€œnormal equation”. Do you mean something like this Normal Equation - ML Wiki ?

@cgeoga Sorry I should have given more details. This is my system:

Julia Version 1.6.1
Commit 6aaedecc44 (2021-04-23 05:59 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
      Microsoft Windows [Version 10.0.18363.1679]
  CPU: Intel(R) Core(TM) i9-10900X CPU @ 3.70GHz:
                 speed         user         nice          sys         idle          irq
       #1-20  3696 MHz  7148222195            0    362645975    33488946257      4022803  ticks

  Memory: 255.6930160522461 GB (243026.65625 MB free)
  Uptime: 2.04999e6 sec
  Load Avg:  0.0  0.0  0.0
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, cascadelake)

With the same definition as in the OP, I get

orthProject(A',c) # ERROR: OutOfMemoryError()
size(A') # (1301264932900, 172696)
size(c) # (1301264932900,)
nnz(A')/prod(size(A')) # 1.6764228976056928e-11
nnz(c)/prod(size(c)) # 1.536966031615713e-12

Then I tried running individual steps in the function orthProject, with A replaced by A':

v = A*c;
nnz(v)/prod(size(v)) # 9.26483531755223e-5
typeof(v) # SparseVector{Float64, Int64}
Ap = A';
typeof(Ap) # SparseMatrixCSC{Float64, Int64}
AAp = A*Ap; # ERROR: OutOfMemoryError()
typeof(A) # Adjoint{Float64, SparseMatrixCSC{Float64, Int64}}
@which A*Ap

Maybe the problem is with dispatch?

*(A::Adjoint{var"#s832", var"#s831"} where {var"#s832", var"#s831"<:SparseArrays.AbstractSparseMatrixCSC},
B::Union{SparseArrays.AbstractSparseMatrixCSC{Tv, Ti}, SubArray{Tv, 2, var"#s832", Tuple{Base.Slice{Base.OneTo{Int64}}, I}, L} 
where {I<:AbstractUnitRange, var"#s832"<:SparseArrays.AbstractSparseMatrixCSC{Tv, Ti}, L}, LowerTriangular{Tv, var"#s831"} 
where var"#s831"<:Union{SparseArrays.AbstractSparseMatrixCSC{Tv, Ti}, SubArray{Tv, 2, var"#s832", Tuple{Base.Slice{Base.OneTo{Int64}}, I}, L} 
where {I<:AbstractUnitRange, var"#s832"<:SparseArrays.AbstractSparseMatrixCSC{Tv, Ti}, L}}, UpperTriangular{Tv, var"#s832"} 
where var"#s832"<:Union{SparseArrays.AbstractSparseMatrixCSC{Tv, Ti}, SubArray{Tv, 2, var"#s832", Tuple{Base.Slice{Base.OneTo{Int64}}, I}, L} 
where {I<:AbstractUnitRange, var"#s832"<:SparseArrays.AbstractSparseMatrixCSC{Tv, Ti}, L}}} 
where {Tv, Ti}) in SparseArrays at C:\Users\Thomas\AppData\Local\Programs\Julia-1.6.1\share\julia\stdlib\v1.6\SparseArrays\src\linalg.jl:173

My original matrix A is given as an Adjoint type. How can I explicitly compute (materialize?) it so that I have a SparseMatrixCSC type? copy(A) runs into OutOfMemoryError.

I think if I can use the dispatch that call sparse multiplication then I can bypass the memory problem.

Thanks for your help!

Do you have any other large objects in scope? It seems like the actual number of floats in that matrix isn’t particularly demanding:

Base.format_bytes(prod(sizes)*sizeof(Float64)*1.6765e-11) # \approx 29 MiB

So if just copy(A) is giving you an error then I have to imagine that something else is going on here. Especially when you have 255 GB (!) of RAM.

Like I was saying earlier, it is conceivable that A*A' does just have too many nonzeros. Maybe I’m mistaken in following when A is transposed and when it is not, but the output of that will be a ```
1301264932900x1301264932900 matrix, which is…substantial. But if just copy(A) fails, then I’m not sure what’s going on. Adjoint operations do β€œjust work” for sparse arrays in my experience, so I don’t think it’s an issue of instantiation. As another sanity check, you run all of this in a fresh REPL?

And as another sanity check, did you try @Oscar_Smith 's suggestion? To snipe your question to him, yes, that is what he is referring to. A\x gives you a least squares solution, which naturally coincides with inv(A)*x when A is invertible, but gives you inv(A'*A)*A'*x when A is rectangular (although obviously not computed like that). When I bet that that would β€œjust work” for you with no memory management thoughts. My above comment about this was really just for culture and to point out what was literally causing the error. Regardless of whether the error can be mitigated, @Oscar_Smith’s advice is definitely the best solution.

EDIT: I really need to proofread my comments better.

EDIT: A factual inaccuracy about the behavior of \ in an earlier version, sorry.

1 Like

Thanks for your help @cgeoga. This is a varinfo() of my current session (almost a fresh REPL):

 name                    size summary                                                                  
  –––––––––––––––– ––––––––––– –––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
  A                 58.802 MiB 172696Γ—1301264932900 adjoint(::SparseMatrixCSC{Float64, Int64}) with eltype Float64
  Ap                58.802 MiB 1301264932900Γ—172696 SparseMatrixCSC{Float64, Int64}                     
  Base                         Module                                                                   
  Core                         Module                                                                   
  InteractiveUtils 261.437 KiB Module                                                                   
  L1DIR               63 bytes 55-codeunit String                                                       
  L2DIR               63 bytes 55-codeunit String                                                       
  L2SYM               66 bytes 58-codeunit String                                                       
  Main                         Module                                                                   
  ans                1.419 KiB Method                                                                   
  b                  3.242 KiB 172696-element reshape(::SparseMatrixCSC{Int64, Int64}, 172696) with eltype Int64
  c                  136 bytes 1301264932900-element SparseVector{Float64, Int64}                       
  mmread2primal        0 bytes typeof(mmread2primal)                                                    
  mmread_threaded      0 bytes typeof(mmread_threaded)                                                  
  symmetrize           0 bytes typeof(symmetrize)                                                       
  v                  360 bytes 172696-element SparseVector{Float64, Int64}            

In the left division M\v, v is then a sparse vector of length 172696 and M is should be a sparse matrix of size 172696 by 172696 (I dont actually know the density but I think should be sparse, and even if not a dense matrix can still fit into my memory)
I think you are probably right with the density part. Before I did a sloppy test with sparse matrix multiplication of two sparse matrices created by sprand and it works fine for square matrices of size 1b. Now I do more accurate testing: with density the same density as my spare matrix input

fakeA = sprand(1301264932900, 172696, density); #ok no problem

fakeA = sprand(172696, 1301264932900, density); # SWITCH SIZE
ERROR: OutOfMemoryError()
Stacktrace:
 [1] Array
   @ .\boot.jl:448 [inlined]
 [2] sparse_sortedlinearindices!(I::Vector{Int64}, V::Vector{Float64}, m::Int64, n::Int64)
   @ SparseArrays C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.6\SparseArrays\src\sparsematrix.jl:1562
 [3] sprand(r::Random.MersenneTwister, m::Int64, n::Int64, density::Float64, rfn::typeof(rand), ::Type{Float64})
   @ SparseArrays C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.6\SparseArrays\src\sparsematrix.jl:1602
 [4] sprand
   @ C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.6\SparseArrays\src\sparsematrix.jl:1612 [inlined]
 [5] sprand(m::Int64, n::Int64, density::Float64)
   @ SparseArrays C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.6\SparseArrays\src\sparsematrix.jl:1610
 [6] top-level scope
   @ REPL[48]:1

Is this too big for SparseMatrixCSC? This seems consistent with varinfo() where I do not have the copy(A) directly stored (only Adjoint object):

  A                 58.802 MiB 172696Γ—1301264932900 adjoint(::SparseMatrixCSC{Float64, Int64}) with eltype Float64
  Ap                58.802 MiB 1301264932900Γ—172696 SparseMatrixCSC{Float64, Int64}       

Regarding @Oscar_Smith’s suggestion, I will check again with Julia’s dispatch to see if it is the method that I need. I basically used the advice in Computing sparse orthogonal projections for orthProject. Maybe I need to revise this thread now. But to reiterate, I just need to compute this (as in (A' * A) \ Vector(A' * v) in orthProject) once so I dont really need the pseudoinverse. Anyway, a quick test (both dispatch to @ SuiteSparse.SPQR):

Ap\c;
ERROR: Sparse QR factorization failed
Stacktrace:
 [1] error(s::String)
   @ Base .\error.jl:33
 [2] _qr!(ordering::Int32, tol::Float64, econ::Int64, getCTX::Int64, A::SuiteSparse.CHOLMOD.Sparse{Float64}, Bsparse::Ptr{Nothing}, Bdense::Ptr{Nothing}, Zsparse::Ptr{Nothing}, Zdense::Ptr{Nothing}, R::Base.RefValue{Ptr{SuiteSparse.CHOLMOD.C_Sparse{Float64}}}, E::Base.RefValue{Ptr{Int64}}, H::Base.RefValue{Ptr{SuiteSparse.CHOLMOD.C_Sparse{Float64}}}, HPinv::Base.RefValue{Ptr{Int64}}, HTau::Base.RefValue{Ptr{SuiteSparse.CHOLMOD.C_Dense{Float64}}})
   @ SuiteSparse.SPQR C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.6\SuiteSparse\src\spqr.jl:74
 [3] qr(A::SparseMatrixCSC{Float64, Int64}; tol::Float64, ordering::Int32)
   @ SuiteSparse.SPQR C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.6\SuiteSparse\src\spqr.jl:205
 [4] qr
   @ C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.6\SuiteSparse\src\spqr.jl:198 [inlined]
 [5] \(A::SparseMatrixCSC{Float64, Int64}, B::SparseVector{Float64, Int64})
   @ SparseArrays C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.6\SparseArrays\src\linalg.jl:1569
 [6] top-level scope
   @ REPL[27]:1


A\c;
ERROR: Sparse QR factorization failed
Stacktrace:
 [1] error(s::String)
   @ Base .\error.jl:33
 [2] _qr!(ordering::Int32, tol::Float64, econ::Int64, getCTX::Int64, A::SuiteSparse.CHOLMOD.Sparse{Float64}, Bsparse::Ptr{Nothing}, Bdense::Ptr{Nothing}, Zsparse::Ptr{Nothing}, Zdense::Ptr{Nothing}, R::Base.RefValue{Ptr{SuiteSparse.CHOLMOD.C_Sparse{Float64}}}, E::Base.RefValue{Ptr{Int64}}, H::Base.RefValue{Ptr{SuiteSparse.CHOLMOD.C_Sparse{Float64}}}, HPinv::Base.RefValue{Ptr{Int64}}, HTau::Base.RefValue{Ptr{SuiteSparse.CHOLMOD.C_Dense{Float64}}})
   @ SuiteSparse.SPQR C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.6\SuiteSparse\src\spqr.jl:74
 [3] qr(A::SparseMatrixCSC{Float64, Int64}; tol::Float64, ordering::Int32)
   @ SuiteSparse.SPQR C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.6\SuiteSparse\src\spqr.jl:205
 [4] qr
   @ C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.6\SuiteSparse\src\spqr.jl:198 [inlined]
 [5] \(xformA::Adjoint{Float64, SparseMatrixCSC{Float64, Int64}}, B::SparseVector{Float64, Int64})
   @ SparseArrays C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.6\SparseArrays\src\linalg.jl:1593
 [6] top-level scope
   @ REPL[30]:1

Edit: full error message for left division.

All super helpful, thanks for the information. Honestly I think this is getting a bit out of my depth. If the sparse constructor is failing, then there is presumably some implementation detail that I am not familiar with that is getting in the way. I doubt that the constructor is naive enough to allocate a buffer of
1301264932900 Int64s, but maybe there is some kind of intermediate allocation that really speeds things up but at these sizes can itself cause problems.

I recognize it’s probably annoying that we keep pushing you to \, but to be clear, it definitely also isn’t computing a pseudoinverse in the background. It gives you back something that agrees with what you’d get if you applied the pseudoinverse, but it’s much smarter than a call to pinv. I’m pretty sure it is a LAPACK routine, and so it is seriously optimized.

That error code for the QR factorization is not super informative, though. I’m sorry about that. If your matrix is low rank and you know it, have you considered something like a randomized range finder? I have an implementation in just a few lines that I’d be happy to share.

EDIT: Okay, I didn’t realize that you had switched the size of each dimension in your instantiation tests. Presumably this is an issue with CSC storage, although I don’t think I know enough to comment beyond that.

1 Like

Following the Stacktrace in my OP:

 [5] ftranspose(A::SparseMatrixCSC{Float64, Int64}, f::Function, eltype::Type{Float64})
   @ SparseArrays C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.6\SparseArrays\src\sparsematrix.jl:1056

does indeed cause an OutOfMemoryError() because it call ones which is too big in my case:

julia>  ones(Int64, size(A,1)+1); # 172696 vector of ones OK

julia>  ones(Int64, size(Ap,1)+1); # 1301264932900 vector of ones
ERROR: OutOfMemoryError()
Stacktrace:
 [1] Array
   @ .\boot.jl:448 [inlined]
 [2] Array
   @ .\boot.jl:457 [inlined]
 [3] ones
   @ .\array.jl:503 [inlined]
 [4] ones(#unused#::Type{Int64}, dims::Int64)
   @ Base .\array.jl:499
 [5] top-level scope
   @ REPL[66]:1

So A'*A or A*A' will always dispatch through copy(A') which then call ones and I run into memory issue when one dimension is too large.

I will be following up on the other direction that uses \ more efficiently.

1 Like

I should have followed up a bit myself in my last comment, sorry. Here is the line in 1.6.2 that is in my stacktrace following your test

fakeA = sprand(1301264932900, 172696, density); # ok no problem
fakeA = sprand(172696, 1301264932900, density); # error

I’m not sure if this warrants a PR that makes a more helpful error message or something at least, but presumably this is an understood tradeoff for CSC storage formats (?).

This always feels rude to summon busy people, but @andreasnoack or some other LinearAlgebra decider, do you think a more verbose error message about this would make sense? I’m not sure if there’s a way to check and see if an allocation wouldn’t fail without actually doing it, but something that says Error("Please build the transpose of this matrix") or something might be helpful.

I think the real issue here is that A'*A ends up materializing the transpose via copy(A') instead of using a specialized symmetric multiplication. Unfortunately, we don’t yet have such a routine neither natively or via a wrapped library. However, it’s available from MKL so it should probably not be too much work to wrap mkl_sparse_syrk in MKLSparse.jl.

Regarding the QR approach then I think the factorization simply creates too much fill-in which then takes up too much memory. The first dimension of A is huge. Hence, the original implementation is probably the more efficient approach.

3 Likes