Sparse arrays allocation versus speed

Hi! I have a question about allocations and speed trade-off when using Sparse Arrays.

TLDR question: Sparse matrix multiplication for my case is fast, but allocates a lot of memory. Using dense matrices my code is slower but allocations are 0, indicating a trade-off between the two. In the grand scheme of things, my whole program uses a lot memory and has garbage collection time of 10%. Should I care about this large memory footprint? Can garbage collection “eat up” the savings from the sparsity?

Detailed question

I am now building a model with a lot of simple linear algebra operations on large matrices that are iterated on thousands of times. I am providing a MWE below with one operation that exemplifies it. I have the matrix multiplication X' \Sigma^{-1} .

Here \Sigma^{-1} is block-diagonal with n \times n blocks and X is T \times n and has a special structure with a lot of zeroes. The matrices have a sparsity of 99% and 95%, respectively, for a typical n. Moreover, \Sigma^{-1} is a submatrix of a bigger matrix \Sigma_{full}^{-1}, where only a handful of the first rows and columns are “deleted”. I am providing examples with typical values for T and n as well as with n=17 (spoiler @views and mul!() with sparse matrices do not work with that n).

My main code uses SparseArrays and I am comparing it to a dense multiplication. I am running the following lines from the MWE below:

@btime $XtΣ_inv = $Xsur_sp'*$Σ_invsp_full[(Tf-T)*n+1:end, (Tf-T)*n+1:end];  # sparse array impementation
@btime mul!($XtΣ_invD,$XsurD',$Σ_invD);                                     # dense array implementation with mul!()
@btime mul!($XtΣ_invD,$XsurD',$Σ_inv_viewD);                                # dense array implementation with @view
@btime $XtΣ_inv = $Xsur_sp'*$Σ_invsp_view;                                  # sprase array implementation with @view
@btime mul!($XtΣ_inv,$Xsur_sp',$Σ_invsp_full[(Tf-T)*n+1:end, (Tf-T)*n+1:end]); # sprase array implementation with mul!()

Let’s look first at the timings with n = 2

n = 2, sparse array impementation:  59.300 μs (42 allocations: 478.45 KiB)
n = 2, dense matrix impementation:  83.000 μs (0 allocations: 0 bytes)
n = 2, dense matrix with view:  82.900 μs (0 allocations: 0 bytes)
n = 2, timing with sparse view:  129.793 ms (27 allocations: 395.69 KiB)
n = 2, timing with sparse mul!:  108.618 ms (16 allocations: 19.11 KiB)

On my main work computer the gap is even wider.

n = 2, sparse array impementation: 45.700 μs (42 allocations: 478.45 KiB)
n = 2, dense matrix implementation 128.400 μs (0 allocations: 0 bytes)

I will call this line between 10000 and 30000 times on each run. For 10k the sparse footprint is ~5GB. My full program shows a total time of 40.920611 seconds (25.06 M allocations: 174.057 GiB, 9.01% gc time, so 5GB are just from this line.

No allocation but, depending on the computer, between 1,.x and 3x the time for dense arrays.

If I go for n=17, it the gain from sparse is much greater but allocs are worse - 175 MB for 1 line!

n = 17, sparse array impementation:  37.380 ms (42 allocations: 171.29 MiB)
n = 17, dense matrix impementation:  239.931 ms (0 allocations: 0 bytes)
n = 17, dense matrix with view:  223.648 ms (0 allocations: 0 bytes)
n = 17, sparse view doesn't completes on my machine
n = 17, with sparse mul!() doesn't completes on my machine

So I am wondering, should I care about these allocations? I do a lot of such manipulations and most of the code is sparse matrices, which probably explains the large footprint currently. If I bother to rewrite the code with normal matrices I expect to reduce the allocations drastically and probably the gc time, but could this be worth it for speed? I only care in the end of the total time of the program.

Last point, I don’t know if it is a bug (see this thread and the github issue there), but using @view or mul!() with sparse arrays is 1000 times slower for n=2. For n=17 it doesn’t compute on my machine. Not @btime, just the line alone.

MWE:

using LinearAlgebra
using SparseArrays
using Revise
using BenchmarkTools


# -------------------------------------------
# this code simply generates the necessary matrices
Tf = 242; p = 5; T=Tf-p; n = 2;
X = rand(T,n*p+1);
repX = kron(X,ones(n,1));
r,c = size(X);
idi = repeat(1:r*n,inner=c);
idj=repeat(1:c*n,r);
Σhatt_inv_sp = randn(n,n);
Σ_invsp_full = kron(sparse(Matrix(1.0I, Tf, Tf)),Σhatt_inv_sp);
# -------------------------------------------



# main matrix No. 1
Xsur_sp = sparse(idi,idj,vec(repX'));

# main matrix No. 2
Σ_invsp = Σ_invsp_full[(Tf-T)*n+1:end, (Tf-T)*n+1:end];

# alternative as a view
Σ_invsp_view = @view Σ_invsp_full[(Tf-T)*n+1:end, (Tf-T)*n+1:end];

# the line in question using sparse matrices
XtΣ_inv = Xsur_sp'*Σ_invsp_full[(Tf-T)*n+1:end, (Tf-T)*n+1:end];  

# timing 
print("n = 2, sparse array impementation:")
@btime $XtΣ_inv = $Xsur_sp'*$Σ_invsp_full[(Tf-T)*n+1:end, (Tf-T)*n+1:end];


# -------------------------------------------
# alternative with dense matrices
# -------------------------------------------
XsurD = Matrix(Xsur_sp);
Σ_inv_full = Matrix(Σ_invsp_full);
XtΣ_invD = Matrix(XtΣ_inv);
Σ_invD   = Matrix(Σ_invsp);
Σ_inv_viewD = @view Σ_inv_full[(Tf-T)*n+1:end, (Tf-T)*n+1:end];


# the line in question with dense matrices
mul!(XtΣ_invD,XsurD',Σ_invD);

# timing 
print("n = 2, dense matrix impementation:")
@btime mul!($XtΣ_invD,$XsurD',$Σ_invD);
# timing with the view
print("n = 2, dense matrix with view:")
@btime mul!($XtΣ_invD,$XsurD',$Σ_inv_viewD);


# -------------------------------------------
# Side note: mul! with sparse matrices just hangs forever for n=17

print("n = 2, timing with sparse view:")
@btime $XtΣ_inv = $Xsur_sp'*$Σ_invsp_view;

print("n = 2, timing with sparse mul!:")
@btime mul!($XtΣ_inv,$Xsur_sp',$Σ_invsp_full[(Tf-T)*n+1:end, (Tf-T)*n+1:end]);





# n = 17



# -------------------------------------------
# this code simply generates the necessary matrices
Tf = 242; p = 5; T=Tf-p; n = 17;
X = rand(T,n*p+1);
repX = kron(X,ones(n,1));
r,c = size(X);
idi = repeat(1:r*n,inner=c);
idj=repeat(1:c*n,r);
Σhatt_inv_sp = randn(n,n);
Σ_invsp_full = kron(sparse(Matrix(1.0I, Tf, Tf)),Σhatt_inv_sp);
# -------------------------------------------



# main matrix No. 1
Xsur_sp = sparse(idi,idj,vec(repX'));

# main matrix No. 2
Σ_invsp = Σ_invsp_full[(Tf-T)*n+1:end, (Tf-T)*n+1:end];

# alternative as a view
Σ_invsp_view = @view Σ_invsp_full[(Tf-T)*n+1:end, (Tf-T)*n+1:end];

# the line in question using sparse matrices
XtΣ_inv = Xsur_sp'*Σ_invsp_full[(Tf-T)*n+1:end, (Tf-T)*n+1:end];  

# timing 
print("n = 17, sparse array impementation:")
@btime $XtΣ_inv = $Xsur_sp'*$Σ_invsp_full[(Tf-T)*n+1:end, (Tf-T)*n+1:end];


# -------------------------------------------
# alternative with dense matrices
# -------------------------------------------
XsurD = Matrix(Xsur_sp);
Σ_inv_full = Matrix(Σ_invsp_full);
XtΣ_invD = Matrix(XtΣ_inv);
Σ_invD   = Matrix(Σ_invsp);
Σ_inv_viewD = @view Σ_inv_full[(Tf-T)*n+1:end, (Tf-T)*n+1:end];


# the line in question with dense matrices
mul!(XtΣ_invD,XsurD',Σ_invD);

# timing 
println("n = 17, dense matrix impementation:")
@btime mul!($XtΣ_invD,$XsurD',$Σ_invD);

# timing with the view
print("n = 17, dense matrix with view:")
@btime mul!($XtΣ_invD,$XsurD',$Σ_inv_viewD);


# Side note: mul! with sparse matrices just hangs forever for n=17

print("n = 17, timing with sparse view never completes on my machine\n")
# @btime $XtΣ_inv = $Xsur_sp'*$Σ_invsp_view;

print("n = 17, timing with sparse mul!() never completes on my machine")
# @btime mul!($XtΣ_inv,$Xsur_sp',$Σ_invsp_full[(Tf-T)*n+1:end, (Tf-T)*n+1:end]);

I think the main issue is coming from the view operation on your right-hand side matrix. I see at least two options to get rid of the inefficiency:

  • code this view manually knowing the contents of a SparseMatrixCSC
  • replace SparseMatrixCSC with a type from BlockDiagonals.jl or BlockBandedMatrices.jl, for which view might be more efficient

I started with BlockBandedMatrices but had other places in the code where they were being slower than the sparse arrays so I wrote the code without them.

To be honest, I am not insistent on using the view with sparse arrays, in the small array case it doesn’t cut allocations that much. I would prefer if mul() would work. Or would you expect to help a lot? I could try again the block-banded part.

One of the issues here is that the result of your multiplication is a huge dense matrix, so encoding it as sparse doesn’t bring any benefits. Are you sure this is a realistic case?

julia> XtΣ_inv = Xsur_sp'*Σ_invsp_full[(Tf-T)*n+1:end, (Tf-T)*n+1:end]
1462×4029 SparseMatrixCSC{Float64, Int64} with 5890398 stored entries:
⎡⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⎤
⎢⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⎥
⎢⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⎥
⎢⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⎥
⎢⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⎥
⎢⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⎥
⎢⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⎥
⎢⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⎥
⎢⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⎥
⎢⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⎥
⎢⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⎥
⎢⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⎥
⎢⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⎥
⎢⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⎥
⎢⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⎥
⎢⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⎥
⎢⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⎥
⎢⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⎥
⎣⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⎦

julia> nnz(XtΣ_inv) / prod(size(XtΣ_inv))
1.0

Huh? Weird, I have to check this again. I looked at it at some point and had values of about 300k/5million.

Thanks I will go through the MWE and the function to find what I messed up.

From what I understand, the sparsity pattern of your right-hand side doesn’t change across the 10000 or so iterations, and only the values inside are updated? If so, I think your best bet might be to manually code the extraction of the bottom right corner of the sparse matrix, like so:

using SparseArrays, Random, BenchmarkTools

"""
    copyto_corner!(B, A, skip_rows, skip_cols)

Perform the same operation as `B .= A[(skip_rows+1):end, (skip_cols+1):end]`
but without allocations.

Assumes that the sparsity pattern of `B` is consistent with
that of the bottom-right corner of `A`, and only transfers the values.
"""
function copyto_corner!(
    B::SparseMatrixCSC, A::SparseMatrixCSC,
    skip_rows::Integer, skip_cols::Integer
)
    rvA, nzA = rowvals(A), nonzeros(A)
    rvB, nzB = rowvals(B), nonzeros(B)
    for j in axes(A, 2)
        if j <= skip_cols
            continue
        else
            kB = first(nzrange(B, j - skip_cols))
            for kA in nzrange(A, j)
                if rvA[kA] <= skip_rows
                    continue
                else
                    @assert rvA[kA] == rvB[kB] + skip_rows
                    nzB[kB] = nzA[kA]
                    kB += 1
                end
            end
        end
    end
    return B
end

Example:

julia> m, n = 10, 20
(10, 20)

julia> skip_rows, skip_cols = 2, 4
(2, 4)

julia> A = sprand(MersenneTwister(0), m, n, 0.3)
10×20 SparseMatrixCSC{Float64, Int64} with 64 stored entries:
⎡⣆⡐⣘⠄⠩⢁⠨⠘⡜⣬⎤
⎢⠂⡊⣡⡈⡤⠈⢨⡆⡞⡖⎥
⎣⠀⠀⠀⠀⠈⠃⠈⠐⠐⠀⎦

julia> B = A[(skip_rows + 1):end, (skip_cols + 1):end]
8×16 SparseMatrixCSC{Float64, Int64} with 43 stored entries:
⎡⠖⠡⠈⠰⠨⡀⣣⣛⎤
⎣⠚⠂⠫⡄⠸⢃⢃⠃⎦

julia> rand!(MersenneTwister(1), nonzeros(A));  # put new values in A

julia> B == A[(skip_rows + 1):end, (skip_cols + 1):end]
false

julia> copyto_corner!(B, A, skip_rows, skip_cols);  # update B

julia> B == A[(skip_rows + 1):end, (skip_cols + 1):end]
true

julia> @btime copyto_corner!($B, $A, $skip_rows, $skip_cols);
  69.744 ns (0 allocations: 0 bytes)
1 Like

That’s right, the sparsity pattern doesn’t change, I will only be updating the values and doing the calculation.

Thank you for your time, that seems clever! Your code gives me the idea that I can instantiate a smaller \Sigma a just update it’s elements like I do the big one. I can test that along with your functioning.

I will check the density of the output when I get the chance but I am now starting to wonder, does the density of the output matter? I thought sparse matrices help when the inputs are sparse, as we don’t need to multiply many elements with zeroes.

Sparse matrices do help in this case, but in some scenarios sparse times sparse becomes dense, and that seems to be the case here. Then you might be better off running mul! into a dense matrix, but maybe that calls a method that is not optimized for sparse arrays. You can find out what is being called with the macro @which mul!(C, A, B)

Good to know, I will check that, because mul!() was my first choice and it was 1000 times slower as shown above.

Meanwhile I figured out the sparsity.

I have two use cases, one where a block in the block diagonals \Sigma is dense, as in the MWE

Σhatt_inv_sp = randn(n,n);
Σ_invsp_full = kron(sparse(Matrix(1.0I, Tf, Tf)),Σhatt_inv_sp);
XtΣ_inv = Xsur_sp'*Σ_invsp_full[(Tf-T)*n+1:end, (Tf-T)*n+1:end];
julia> nnz(XtΣ_inv) / prod(size(XtΣ_inv))
1.0

and one where it is a diagonal matrix.

Σhatt_inv_sp = 1.0I(n);
Σ_invsp_full = kron(sparse(Matrix(1.0I, Tf, Tf)),Σhatt_inv_sp);
XtΣ_inv = Xsur_sp'*Σ_invsp_full[(Tf-T)*n+1:end, (Tf-T)*n+1:end];
julia> nnz(XtΣ_inv) / prod(size(XtΣ_inv))
0.058823529411764705

I checked with the second case and didn’t check the first.

Geesh, I wish there were a language where I could distpatch to the fastet operation depending on different inputs :smiley:

I would like to give some closure to the topic.

I rewrote the code to use regular matrices instead of sparse and I could leverage mul!(), ldiv!(), and @view very efficiently. I got tremendous speedup and much less allocations.

Here is the function itself with 4 implementations, where the first two use sparse matrices and the last two are without, only measured with @time

  0.721347 seconds (944 allocations: 266.088 MiB, 41.04% gc time)
  0.727664 seconds (914 allocations: 265.672 MiB, 40.43% gc time)
  0.158879 seconds (744 allocations: 33.713 MiB, 1.27% gc time)
  0.170166 seconds (722 allocations: 33.677 MiB, 2.59% gc time)

And here is the function as part of a larger model:

 16.876194 seconds (391.23 k allocations: 8.699 GiB, 23.92% gc time, 1.13% compilation time)
 14.811057 seconds (263.82 k allocations: 8.685 GiB, 16.73% gc time, 0.33% compilation time)
  6.620201 seconds (260.26 k allocations: 4.154 GiB, 16.99% gc time, 0.61% compilation time)
  6.404180 seconds (260.03 k allocations: 4.154 GiB, 12.55% gc time, 0.63% compilation time)

At least for my application, where I am doing a lot of basic linear algebra operations on the same matrices, even with very high sparsity (context dependent), sparse matrices appear not to be the optimal choice. The calculations that I am doing, for context of readers of this involve calculating

\mathbf{K}_\beta=\mathbf{V}_{\mathrm{Minn}}^{-1}+\mathbf{X}^{\prime}\left(\mathbf{I}_T \otimes \widehat{\Sigma}^{-1}\right) \mathbf{X}, and
\mathbf{C}_{\mathbf{K}_\beta}^{\prime} \backslash\left(\mathbf{C}_{\mathbf{K}_\beta} \backslash\left(\mathbf{V}_{\operatorname{Minn}}^{-1} \beta_{\operatorname{Minn}}+\mathbf{X}^{\prime}\left(\mathbf{I}_T \otimes \widehat{\Sigma}^{-1}\right) \mathbf{y}\right)\right),
where \mathbf{C}_{\mathbf{K}_\beta} is the lower Cholesky factor and \mathbf{C}_{\mathbf{K}_\beta}' is the upper Cholesky factor and the matrices are of the size, for example, (3555, 1140) for \mathbf{X} and \mathbf{K}_\beta is 1140 \times 1140. For some applications, as shown above, the density is around 5%, while for others some matrices are not sparse at all.

1 Like