Scaling a sparse matrix row-wise and column-wise too slow

pitsianis · June 21, 2024, 12:51am

I need to scale a sparse matrix row-wise and column-wise. That is

S[i,j] = r[i] * A[i,j] * c[j]

If I write it the way you’d see it in a math book,

S = spdiagm(r) * A * spdiagm(c)

the expression takes 8x times more than the MATLAB

S = r .* A .* c.'

And if I attempt the above in Julia, it either runs forever, or runs out of memory …

Obviously, S has the same non-zero pattern as A and one can restrict the indices i and j accordingly, so somewhere between the Broadcast and the SparseArrays the ball is lost… Where do I go to get this scaling to run efficiently and competitively?

Here is a MWE to play with

using LinearAlgebra, SparseArrays, BenchmarkTools

n = 10_000_000
A = sprand(n,n,2/n)
r = rand(n,1); c = rand(n,1)
S = r .* A .* c'    # ERROR: OutOfMemoryError()

@benchmark $S = spdiagm($(vec(r))) * $A * spdiagm($(vec(c)))

While in MATLAB

>> A = sprand(10e6,10e6,2/10e6);
>> r = rand(10e6,1);
>> c = rand(10e6,1);
>> timeit (@() r .* A .* c.')

ans =

    0.2720

which is 8.5 times faster than the Julia benchmark code above

gdalle · June 21, 2024, 6:28am

I encountered this problem while working on HiddenMarkovModels.jl. It was a while ago but I think the issue is that at least one of these multiplications results in a dense matrix.
Here’s the code I wrote to get around it:

github.com

gdalle/HiddenMarkovModels.jl/blob/a8b048abdd0305528b62e1efa8efd6b74621c4b5/src/utils/linalg.jl#L14-L44


      
          """
              mul_rows_cols!(B, l, A, r)
          
          Perform the in-place operation `B .= l .* A .* transpose(r)`.
          """
          function mul_rows_cols!(
              B::AbstractMatrix, l::AbstractVector, A::AbstractMatrix, r::AbstractVector
          )
              B .= l .* A .* transpose(r)
              return B
          end
          
          function mul_rows_cols!(
              B::SparseMatrixCSC, l::AbstractVector, A::SparseMatrixCSC, r::AbstractVector
          )
              @argcheck axes(A, 1) == eachindex(r)
              @argcheck axes(A, 2) == eachindex(l)
              @argcheck size(A) == size(B)
              @argcheck nnz(B) == nnz(A)
              Brv = rowvals(B)

This file has been truncated. show original

jishnub · June 21, 2024, 7:27am

Please report this issue to SparseArrays.jl, as it looks like some 3-term multiplication methods need to be added there.

DNF · June 21, 2024, 7:39am

Why are you using sparse r and c matrices in Julia, but regular vectors in Matlab? Have you tried

r = rand(n); c = rand(n);  # don't use rand(n, 1)! 
@benchmark $r .* $A .* $c'

?

foobar_lv2 · June 21, 2024, 9:09am

Just do it by hand:

function scaleTheThing!(A,R,C)
   for col = 1:A.n
      c = C[col]
      for ptr = A.colptr[col]:(A.colptr[col+1]-1)
         A.nzval[ptr] = (R[A.rowval[ptr]] * A.nzval[ptr]) * c
      end
   end
   A
end

You can verify correctness by

julia> compare(A,R,C) = scaleTheThing!(copy(A),R,C)
julia> reference(A,R,C)=spdiagm(R)*A*spdiagm(C)
julia> N=1000; M = 500 ;p=0.05; A=sprand(N,M,p); R=rand(N);C=rand(M);
julia> reference(A,R,C) == compare(A,R,C)
true

You can benchmark:

julia> N=10_000_000; M = N; p=2/N; A=sprand(N,M,p); R=rand(N);C=rand(M);
julia> @benchmark compare($A,$R,$C)
BenchmarkTools.Trial: 16 samples with 1 evaluation.
 Range (min … max):  312.226 ms … 412.754 ms  ┊ GC (min … max): 0.32% … 19.48%
 Time  (median):     314.742 ms               ┊ GC (median):    0.34%
 Time  (mean ± σ):   331.734 ms ±  32.353 ms  ┊ GC (mean ± σ):  4.49% ±  7.60%

  █▅                                                             
  ██▅▁▁▁▁▅▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅▁▁▅▁▁▁▁▁▁▁▁▁▁▁▁▁▅ ▁
  312 ms           Histogram: frequency by time          413 ms <

 Memory estimate: 381.46 MiB, allocs estimate: 9.

julia> @benchmark scaleTheThing!($A,$R,$C)
BenchmarkTools.Trial: 20 samples with 1 evaluation.
 Range (min … max):  251.287 ms … 279.071 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     262.039 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   262.480 ms ±   7.348 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▁▁    ▁     ▁▁ █ ▁   ▁ █ █▁ ▁      ▁ ▁         ▁    ▁       ▁  
  ██▁▁▁▁█▁▁▁▁▁██▁█▁█▁▁▁█▁█▁██▁█▁▁▁▁▁▁█▁█▁▁▁▁▁▁▁▁▁█▁▁▁▁█▁▁▁▁▁▁▁█ ▁
  251 ms           Histogram: frequency by time          279 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

You can sprinkle explicit bounds-checks and @inbounds if you want. That brings it down to 175ms for me.

stevengj · June 21, 2024, 11:26am

In principle, the right thing to do should be S = Diagonal(r) * A * Diagonal(c), since Diagonal preserves the special structure of a diagonal matrix but spdiagm does not (it stores the diagonal matrix in a generic sparse-matrix data structure).

Confusingly, however, this seems to be slower on my machine despite the fact that SparseArrays has specialized Diagonal methods.

julia> @btime spdiagm($r) * $A * spdiagm($c);
  2.594 ms (60 allocations: 5.35 MiB)

julia> @btime Diagonal($r) * $A * Diagonal($c);
  334.934 ms (22 allocations: 1.49 GiB)

(Notice also the huge allocations, even though the result is a SparseMatrixCSC.) Not sure what is going on here, but it seems fixable?

julia> @btime scaleTheThing!(copy($A),$r,$c);
  294.107 μs (6 allocations: 1.61 MiB)

Yes, this is definitely faster, and it’s nice that this is possible to implement in Julia, but it seems like using Diagonal should be within a factor of two or so of this, even if it is implemented as two separate multiplications (rather than a specialized 3-term multiplication).

gdalle · June 21, 2024, 12:05pm

I seem to recall I had tried the Diagonal way in my use case but found the same slowness, hence the manual hack. In my view this does indeed belong in the stdlib

DNF · June 21, 2024, 3:57pm

There seems to be a weird problem here:

julia> n = 10_000; A = sprand(n,n,2/n); r = rand(n);

julia> @btime $A .*= $r';
  651.600 μs (26 allocations: 967.91 KiB)

julia> @btime $A .*= $r;
  150.513 ms (14 allocations: 633.83 KiB)

This should really work with just, r .* A .* c', and apparently, it’s the column multiplication that causes out-of-memory.

pitsianis · June 21, 2024, 11:20pm

Thank you all for your help and ideas. I think we all agree that .* should work with sparse matrices as it works with dense matrices but without running out of memory. I have opened this issue.

Dan · June 22, 2024, 4:24pm

Wanted to add a couple of methods to scale rows or columns on a sparse Matrix, which I came up with while experimenting.

scaleTheThing! is considered a good baseline for optimization. The separate scaling of rows and columns below add up to abouth the same as the complete call to scaleTheThing!, so these can be helpful in the eventual implementation in SparseArrays:

First a bit of prep:

julia> import .Iterators as It

julia> using SparseArrays, BenchmarkTools

julia> n = 10_000; A = sprand(n,n,2/n); r = rand(n);

julia> function scaleTheThing!(A,R,C)
          for col = 1:A.n
             c = C[col]
             for ptr = A.colptr[col]:(A.colptr[col+1]-1)
                A.nzval[ptr] = (R[A.rowval[ptr]] * A.nzval[ptr]) * c
             end
          end
          A
       end
scaleTheThing! (generic function with 1 method)

Now to the benchmark. These values are, of course, specific to machine and Julia 1.10.2 which I’m currently using.

julia> @btime for i in eachindex(M.nzval) M.nzval[i] *= ($r)[M.rowval[i]] ; end (setup = M = copy(A));
  9.714 μs (0 allocations: 0 bytes)

julia> @btime for (i,j) in enumerate(It.flatten(It.repeated(i, M.colptr[i+1] - M.colptr[i]) for i in axes(M,2)))
       M.nzval[i] *= $r[j]
       end (setup =  M = copy(A))
  43.120 μs (0 allocations: 0 bytes)

julia> @btime scaleTheThing!(M,$r,$r) (setup = M = copy(A));
  55.861 μs (1 allocation: 48 bytes)

separate scaling = 9.7μs + 43.1μs ~ 53μs ~< 55.9μs = scaleTheThing!

DNF · June 22, 2024, 5:45pm

Dan:

function scaleTheThing!(A,R,C)
          for col = 1:A.n
             c = C[col]
             for ptr = A.colptr[col]:(A.colptr[col+1]-1)
                A.nzval[ptr] = (R[A.rowval[ptr]] * A.nzval[ptr]) * c
             end
          end
          A
       end

Isn’t this thing heavily using internal parts of sparse arrays? Is this really part of the API?

Normally, one would strongly discourage accessing internal fields, as it makes the code more fragile and less composable, and if it’s shared with others, it contributes to degrading the entire ecosystem.

pitsianis · June 22, 2024, 6:08pm

Then the only alternative will be to use findnz(A) . Is there a way to get an iterator out of findnz(A) so that we won’t need to allocate for the COO triplets of A?

Dan · June 22, 2024, 6:42pm

Yes, it does. The intention is for code like this to be added inside SparseArrays.

kevbonham · June 22, 2024, 8:09pm

I don’t think it’s the same, since this is all multiplication, but could it be related to Broadcast operations across multiple dimensions materialize zeros in sparse matrices · Issue #25 · JuliaSparse/SparseArrays.jl · GitHub?

foobar_lv2 · June 23, 2024, 1:27pm

I wouldn’t worry too much about that kind of thing.

You need to consider what kind of code you’re writing.

If your code is intended to be used for a project by few people and then be abandoned – who cares?

If your code is intended to be used by many people/projects for a long time, and you plan to actively continue to develop and maintain it – then this is a kind of tech debt you can gladly take on, and adapt to upstream changes in the future.

If your code is intended to be used by many people/projects for a long time without active maintenance, then you need to get it right. Then you need to think about how stable the internal API you’re using is. Look at the git history for that.

This specific thing – internal fields of sparseMatrixCSC – is quite stable. I just looked at it in git; last layout change was from v0.1 0 → 0.2, more than 10 years ago. It is imo unlikely that this will ever change again. (I wouldn’t be too surprised if the internal fields changed from Array->Memory in the future, but that would be compatible with scaleThatThing!)

DNF · June 23, 2024, 2:01pm

I just have a really visceral reaction to this kind of thing.

But anyway, it seemed to me like the correct solution is to fix broadcasted multiplication with column vector.

stevengj · June 23, 2024, 6:06pm

I would speed up multiplication with Diagonal first — that should be a lot easier than mucking with the broadcast machinery, especially since it already has optimized mul! methods (that seem to be mysteriously slow here). It’s also more natural than broadcast from a linear-algebra perspective.

ericphanson · June 23, 2024, 6:33pm

It’s clearer to just not use internals though:

function scaleTheThing2!(A, R, C)
   rows = rowvals(A)
   vals = nonzeros(A)
   n = size(A, 2)
   for j in 1:n
      c = C[j]
      for i in nzrange(A, j)
         row = rows[i]
         vals[i] *= R[row] * c
      end
   end
   A
end

It seems slightly faster also, though that might just be a benchmarking artifact:

julia> @btime scaleTheThing!(M,$r,$r) (setup = M = copy(A));
  46.708 μs (1 allocation: 48 bytes)

julia> @btime scaleTheThing2!(M,$r,$r) (setup = M = copy(A));
  42.375 μs (1 allocation: 48 bytes)

julia> scaleTheThing2!(copy(A), r, r) ≈ scaleTheThing!(copy(A), r, r)
true

BTW whenever writing sparse code, I do ?nzrange and it writes out the basic loop you need, pretty handy.

edit: adding some boundschecks and @inbounds speeds it up further for me:

scaleTheThing3!

function scaleTheThing3!(A, R, C)
   m, n = size(A)
   @boundscheck begin
      checkbounds(R, 1:m)
      checkbounds(C, 1:n)
      checkbounds(A, 1:m, 1:n)
   end
   rows = rowvals(A)
   vals = nonzeros(A)
   @inbounds for j in 1:n
      c = C[j]
      for i in nzrange(A, j)
         row = rows[i]
         vals[i] *= R[row] * c
      end
   end
   A
end

julia> @btime scaleTheThing3!(M,$r,$r) (setup = M = copy(A));
  27.167 μs (1 allocation: 48 bytes)

foobar_lv2 · June 23, 2024, 7:40pm

This is indeed much clearer than my proposal! I was not aware of these quite nice helper-functions.

Dan · June 23, 2024, 8:24pm

What is this 1 allocation and can we get rid of it?

Topic		Replies	Views
Potential performance regression wrt 0.6: Products involving diagonal matrices MUCH slower than with general sparse matrices Performance	7	856	September 21, 2019
Sparse matrix-vector product: much more slow than Matlab Performance matlab , optimization	24	4639	December 20, 2017
Asymmetric speed of in-place `sparse*dense` matrix product General Usage	7	1559	November 8, 2018
Speed comparison: Sparse matrix multiplication vs usual matrix multiplication General Usage question	7	3943	February 13, 2017
How to efficiently construct a large SparseArray? Packages for this? Performance package , performance , parallel , sparse	20	1840	May 15, 2022

Scaling a sparse matrix row-wise and column-wise too slow

Related topics