Scaling a sparse matrix row-wise and column-wise too slow

I need to scale a sparse matrix row-wise and column-wise. That is

S[i,j] = r[i] * A[i,j] * c[j]

If I write it the way you’d see it in a math book,

S = spdiagm(r) * A * spdiagm(c)

the expression takes 8x times more than the MATLAB

S = r .* A .* c.'

And if I attempt the above in Julia, it either runs forever, or runs out of memory …

Obviously, S has the same non-zero pattern as A and one can restrict the indices i and j accordingly, so somewhere between the Broadcast and the SparseArrays the ball is lost… Where do I go to get this scaling to run efficiently and competitively?

Here is a MWE to play with

using LinearAlgebra, SparseArrays, BenchmarkTools

n = 10_000_000
A = sprand(n,n,2/n)
r = rand(n,1); c = rand(n,1)
S = r .* A .* c'    # ERROR: OutOfMemoryError()

@benchmark $S = spdiagm($(vec(r))) * $A * spdiagm($(vec(c)))

While in MATLAB

>> A = sprand(10e6,10e6,2/10e6);
>> r = rand(10e6,1);
>> c = rand(10e6,1);
>> timeit (@() r .* A .* c.')

ans =


which is 8.5 times faster than the Julia benchmark code above

I encountered this problem while working on HiddenMarkovModels.jl. It was a while ago but I think the issue is that at least one of these multiplications results in a dense matrix.
Here’s the code I wrote to get around it:

1 Like

Please report this issue to SparseArrays.jl, as it looks like some 3-term multiplication methods need to be added there.

Why are you using sparse r and c matrices in Julia, but regular vectors in Matlab? Have you tried

r = rand(n); c = rand(n);  # don't use rand(n, 1)! 
@benchmark $r .* $A .* $c'


1 Like

Just do it by hand:

function scaleTheThing!(A,R,C)
   for col = 1:A.n
      c = C[col]
      for ptr = A.colptr[col]:(A.colptr[col+1]-1)
         A.nzval[ptr] = (R[A.rowval[ptr]] * A.nzval[ptr]) * c

You can verify correctness by

julia> compare(A,R,C) = scaleTheThing!(copy(A),R,C)
julia> reference(A,R,C)=spdiagm(R)*A*spdiagm(C)
julia> N=1000; M = 500 ;p=0.05; A=sprand(N,M,p); R=rand(N);C=rand(M);
julia> reference(A,R,C) == compare(A,R,C)

You can benchmark:

julia> N=10_000_000; M = N; p=2/N; A=sprand(N,M,p); R=rand(N);C=rand(M);
julia> @benchmark compare($A,$R,$C)
BenchmarkTools.Trial: 16 samples with 1 evaluation.
 Range (min … max):  312.226 ms … 412.754 ms  ┊ GC (min … max): 0.32% … 19.48%
 Time  (median):     314.742 ms               ┊ GC (median):    0.34%
 Time  (mean ± σ):   331.734 ms ±  32.353 ms  ┊ GC (mean ± σ):  4.49% ±  7.60%

  ██▅▁▁▁▁▅▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅▁▁▅▁▁▁▁▁▁▁▁▁▁▁▁▁▅ ▁
  312 ms           Histogram: frequency by time          413 ms <

 Memory estimate: 381.46 MiB, allocs estimate: 9.

julia> @benchmark scaleTheThing!($A,$R,$C)
BenchmarkTools.Trial: 20 samples with 1 evaluation.
 Range (min … max):  251.287 ms … 279.071 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     262.039 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   262.480 ms ±   7.348 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▁▁    ▁     ▁▁ █ ▁   ▁ █ █▁ ▁      ▁ ▁         ▁    ▁       ▁  
  ██▁▁▁▁█▁▁▁▁▁██▁█▁█▁▁▁█▁█▁██▁█▁▁▁▁▁▁█▁█▁▁▁▁▁▁▁▁▁█▁▁▁▁█▁▁▁▁▁▁▁█ ▁
  251 ms           Histogram: frequency by time          279 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

You can sprinkle explicit bounds-checks and @inbounds if you want. That brings it down to 175ms for me.

1 Like

In principle, the right thing to do should be S = Diagonal(r) * A * Diagonal(c), since Diagonal preserves the special structure of a diagonal matrix but spdiagm does not (it stores the diagonal matrix in a generic sparse-matrix data structure).

Confusingly, however, this seems to be slower on my machine despite the fact that SparseArrays has specialized Diagonal methods. :confused:

julia> @btime spdiagm($r) * $A * spdiagm($c);
  2.594 ms (60 allocations: 5.35 MiB)

julia> @btime Diagonal($r) * $A * Diagonal($c);
  334.934 ms (22 allocations: 1.49 GiB)

(Notice also the huge allocations, even though the result is a SparseMatrixCSC.) Not sure what is going on here, but it seems fixable?

julia> @btime scaleTheThing!(copy($A),$r,$c);
  294.107 μs (6 allocations: 1.61 MiB)

Yes, this is definitely faster, and it’s nice that this is possible to implement in Julia, but it seems like using Diagonal should be within a factor of two or so of this, even if it is implemented as two separate multiplications (rather than a specialized 3-term multiplication).


I seem to recall I had tried the Diagonal way in my use case but found the same slowness, hence the manual hack. In my view this does indeed belong in the stdlib

1 Like

There seems to be a weird problem here:

julia> n = 10_000; A = sprand(n,n,2/n); r = rand(n);

julia> @btime $A .*= $r';
  651.600 μs (26 allocations: 967.91 KiB)

julia> @btime $A .*= $r;
  150.513 ms (14 allocations: 633.83 KiB)

This should really work with just, r .* A .* c', and apparently, it’s the column multiplication that causes out-of-memory.

1 Like

Thank you all for your help and ideas. I think we all agree that .* should work with sparse matrices as it works with dense matrices but without running out of memory. I have opened this issue.


Wanted to add a couple of methods to scale rows or columns on a sparse Matrix, which I came up with while experimenting.

scaleTheThing! is considered a good baseline for optimization. The separate scaling of rows and columns below add up to abouth the same as the complete call to scaleTheThing!, so these can be helpful in the eventual implementation in SparseArrays:

First a bit of prep:

julia> import .Iterators as It

julia> using SparseArrays, BenchmarkTools

julia> n = 10_000; A = sprand(n,n,2/n); r = rand(n);

julia> function scaleTheThing!(A,R,C)
          for col = 1:A.n
             c = C[col]
             for ptr = A.colptr[col]:(A.colptr[col+1]-1)
                A.nzval[ptr] = (R[A.rowval[ptr]] * A.nzval[ptr]) * c
scaleTheThing! (generic function with 1 method)

Now to the benchmark. These values are, of course, specific to machine and Julia 1.10.2 which I’m currently using.

julia> @btime for i in eachindex(M.nzval) M.nzval[i] *= ($r)[M.rowval[i]] ; end (setup = M = copy(A));
  9.714 μs (0 allocations: 0 bytes)

julia> @btime for (i,j) in enumerate(It.flatten(It.repeated(i, M.colptr[i+1] - M.colptr[i]) for i in axes(M,2)))
       M.nzval[i] *= $r[j]
       end (setup =  M = copy(A))
  43.120 μs (0 allocations: 0 bytes)

julia> @btime scaleTheThing!(M,$r,$r) (setup = M = copy(A));
  55.861 μs (1 allocation: 48 bytes)

separate scaling = 9.7μs + 43.1μs ~ 53μs ~< 55.9μs = scaleTheThing!

Isn’t this thing heavily using internal parts of sparse arrays? Is this really part of the API?

Normally, one would strongly discourage accessing internal fields, as it makes the code more fragile and less composable, and if it’s shared with others, it contributes to degrading the entire ecosystem.

1 Like

Then the only alternative will be to use findnz(A) . Is there a way to get an iterator out of findnz(A) so that we won’t need to allocate for the COO triplets of A?

Yes, it does. The intention is for code like this to be added inside SparseArrays.

1 Like

I don’t think it’s the same, since this is all multiplication, but could it be related to Broadcast operations across multiple dimensions materialize zeros in sparse matrices · Issue #25 · JuliaSparse/SparseArrays.jl · GitHub?

I wouldn’t worry too much about that kind of thing.

You need to consider what kind of code you’re writing.

If your code is intended to be used for a project by few people and then be abandoned – who cares?

If your code is intended to be used by many people/projects for a long time, and you plan to actively continue to develop and maintain it – then this is a kind of tech debt you can gladly take on, and adapt to upstream changes in the future.

If your code is intended to be used by many people/projects for a long time without active maintenance, then you need to get it right. Then you need to think about how stable the internal API you’re using is. Look at the git history for that.

This specific thing – internal fields of sparseMatrixCSC – is quite stable. I just looked at it in git; last layout change was from v0.1 0 → 0.2, more than 10 years ago. It is imo unlikely that this will ever change again. (I wouldn’t be too surprised if the internal fields changed from Array->Memory in the future, but that would be compatible with scaleThatThing!)

I just have a really visceral reaction to this kind of thing.

But anyway, it seemed to me like the correct solution is to fix broadcasted multiplication with column vector.


I would speed up multiplication with Diagonal first — that should be a lot easier than mucking with the broadcast machinery, especially since it already has optimized mul! methods (that seem to be mysteriously slow here). It’s also more natural than broadcast from a linear-algebra perspective.

1 Like

It’s clearer to just not use internals though:

function scaleTheThing2!(A, R, C)
   rows = rowvals(A)
   vals = nonzeros(A)
   n = size(A, 2)
   for j in 1:n
      c = C[j]
      for i in nzrange(A, j)
         row = rows[i]
         vals[i] *= R[row] * c

It seems slightly faster also, though that might just be a benchmarking artifact:

julia> @btime scaleTheThing!(M,$r,$r) (setup = M = copy(A));
  46.708 μs (1 allocation: 48 bytes)

julia> @btime scaleTheThing2!(M,$r,$r) (setup = M = copy(A));
  42.375 μs (1 allocation: 48 bytes)

julia> scaleTheThing2!(copy(A), r, r) ≈ scaleTheThing!(copy(A), r, r)

BTW whenever writing sparse code, I do ?nzrange and it writes out the basic loop you need, pretty handy.

edit: adding some boundschecks and @inbounds speeds it up further for me:

function scaleTheThing3!(A, R, C)
   m, n = size(A)
   @boundscheck begin
      checkbounds(R, 1:m)
      checkbounds(C, 1:n)
      checkbounds(A, 1:m, 1:n)
   rows = rowvals(A)
   vals = nonzeros(A)
   @inbounds for j in 1:n
      c = C[j]
      for i in nzrange(A, j)
         row = rows[i]
         vals[i] *= R[row] * c
julia> @btime scaleTheThing3!(M,$r,$r) (setup = M = copy(A));
  27.167 μs (1 allocation: 48 bytes)

This is indeed much clearer than my proposal! I was not aware of these quite nice helper-functions.

1 Like

What is this 1 allocation and can we get rid of it?