QR = qr(A::SparseMatrixCSC) # A = sparse matrix of size (1768, 39816100)
y = QR.Q' * x[QR.prow] # slow for x::SparseVector
y[size(A,2)+1:end] .= 0
p = QR.Q * y
p[QR.prow] = p

The multiplication of (dense) QR.Q' with permuted sparse x[QR.prow] is slow. Julia currently compute the vector x[QR.prow] and then do optimized matmul. Note that QR is a SuiteSparse factorization object returned by sparse qr decomposition and afaik is typically dense and stored in an optimized way to multiply with a vector. Any ideas on how to improve this further?

Edit: I might have found the correct bottleneck: even this y = QR.Q' * x is slow so the problem might not be due to getindex. I think I am computing a dense matrix vector multiplication.

The x[QR.prow] allocates a bunch of new memory, which may be what slows you down here as I believe x is somewhat large. Sadly, using a view here doesn’t seem to help too much:

julia> @time y = QR.Q' * x[QR.prow];
0.000117 seconds (2.18 k allocations: 126.141 KiB)
julia> @time y = QR.Q' * @view x[QR.prow];
0.000104 seconds (2.17 k allocations: 125.047 KiB)

I think at this point there’s some expert in terms of sparse matrices etc. needed

Nevertheless, doesn’t the suggestion at the end of the post you linked do what you want? This one:

Thanks for your suggestion. The suggested method works for full-rank case but I am developing an algorithm for rank-deficient case which requires the former method. I think A\x dispatch to qr(A)\x which computes a basic least square solution (not for rank-deficient case).

Also in some case A::Adjoint{<:Any, <:AbstractSparseMatrix} and due to memory issue I cannot materialize the adjoint (OutOfMemoryError with sparse A'*A) and has to work with orthogonal projection via A.parent.