Speed up matrix multiplication with permuted vector

I have a very large sparse matrix of the order (1768, 39816100) that goes into the following (part of Computing sparse orthogonal projections - #2 by stevengj):

QR = qr(A::SparseMatrixCSC) # A = sparse matrix of size (1768, 39816100)
y = QR.Q' * x[QR.prow] # slow for x::SparseVector
y[size(A,2)+1:end] .= 0
p = QR.Q * y
p[QR.prow] = p  

The multiplication of (dense) QR.Q' with permuted sparse x[QR.prow] is slow. Julia currently compute the vector x[QR.prow] and then do optimized matmul. Note that QR is a SuiteSparse factorization object returned by sparse qr decomposition and afaik is typically dense and stored in an optimized way to multiply with a vector. Any ideas on how to improve this further?

What’s x and A?

Edit: In terms of how it’s constructed, what its type is etc. Makes a difference for checking which code is called :slight_smile:

1 Like

That’s not a documented field of the object returned by qr(A), where does this come from?

1 Like

Sorry I am editing the original code as you asked :grinning:

Edit: I might have found the correct bottleneck: even this y = QR.Q' * x is slow so the problem might not be due to getindex. I think I am computing a dense matrix vector multiplication.

Sorry, I still have trouble constructing this x :sweat_smile:

Does this look correct for a MWE (Please provide a copy-pastable example next time :slight_smile: )?

julia> A = sprand(10,20,0.5)                                  
10×20 SparseMatrixCSC{Float64, Int64} with 93 stored entries: 
⡙⠂⢍⣘⢪⣎⠴⢁⣹⡷                                                    
⠟⣅⢆⢀⣗⣠⡡⠨⡈⢬                                                    
⠈⠚⠘⠛⠂⠉⠉⠈⠊⠚                                                    
                                                              
julia> x = sprand(10, 0.5)                                    
10-element SparseVector{Float64, Int64} with 5 stored entries:
  [1 ]  =  0.766693                                           
  [2 ]  =  0.66149                                            
  [4 ]  =  0.385874                                           
  [8 ]  =  0.90622                                            
  [10]  =  0.931918                                           

The x[QR.prow] allocates a bunch of new memory, which may be what slows you down here as I believe x is somewhat large. Sadly, using a view here doesn’t seem to help too much:

julia> @time y = QR.Q' * x[QR.prow];                
  0.000117 seconds (2.18 k allocations: 126.141 KiB)
                                                    
julia> @time y = QR.Q' * @view x[QR.prow];          
  0.000104 seconds (2.17 k allocations: 125.047 KiB)

I think at this point there’s some expert in terms of sparse matrices etc. needed :sweat_smile:

Nevertheless, doesn’t the suggestion at the end of the post you linked do what you want? This one:

1 Like

I checked just compute x[QR.prow] is also slow.

Thanks for your suggestion. The suggested method works for full-rank case but I am developing an algorithm for rank-deficient case which requires the former method. I think A\x dispatch to qr(A)\x which computes a basic least square solution (not for rank-deficient case).

Also in some case A::Adjoint{<:Any, <:AbstractSparseMatrix} and due to memory issue I cannot materialize the adjoint (OutOfMemoryError with sparse A'*A) and has to work with orthogonal projection via A.parent.