Speeding up elementwise Vector-SparseMatrixCSC multiplication broadcasting

I noticed today a bottleneck in my code which was causing significant performance loss.

Given a vector b and a sparse matrix A, let b \odot A be the elementwise multiplication of b on each column of A. An example in Julia could look like this:

using SparseArrays
n = 3
b = collect(1:n)
A = sparse(collect(1:n), collect(1:n), rand(1:n))
b .* A

It seems however, that this is not natively handled very well as seen by cranking n up.

@time begin
   n = 3
   b = collect(1:n)
   A = sparse(collect(1:n), collect(1:n), rand(1:n))
   b .* A
end
0.000028 seconds (28 allocations: 2.375 KiB)

@time begin
   n = 10000
   b = collect(1:n)
   A = sparse(collect(1:n), collect(1:n), rand(1:n))
   b .* A
end
0.247870 seconds (51 allocations: 1.491 GiB, 25.08% gc time)

@time begin
   n = 100000
   b = collect(1:n)
   A = sparse(collect(1:n), collect(1:n), rand(1:n))
   b .* A
end
ERROR: OutOfMemoryError()

However, if I define the following function, which computes the same thing, I can do much better. It’s significantly faster and uses much less memory.

function fun(b::Vector, A::SparseMatrixCSC)
   B = copy(A)
   B.nzval .= B.nzval .* b[B.rowval]
   return B
end

@time begin
   n = 100000
   b = collect(1:n)
   A = sparse(collect(1:n), collect(1:n), rand(1:n))
   fun(b, A)
end
0.005198 seconds (43 allocations: 11.446 MiB)

@time begin
   n = Int(1e6)
   b = collect(1:n)
   A = sparse(collect(1:n), collect(1:n), rand(1:n))
   fun(b, A)
end
0.068245 seconds (43 allocations: 114.443 MiB, 7.60% gc time)

Is there a reason .* does not behave internally in this way? Can I overload it myself to make it behave in this way? I looked a bit at Base.broadcast and I have made custom broadcasting for my own callable struct types before, in places where I could gain significant speed-ups by computing the elementwise operations differently than a naive elementwise function call. But I couldn’t seem to figure out how overloading broadcasting for * is supposed to be expressed.

It looks like this is one of the more sane use cases for sparse matrix broadcasts, so shouldn’t it be supported by SparseArrays?

Thanks, I’m always happy to be affirmed in my sanity! :sweat_smile:

That aside, yes, SparseArrays is of course where I’d put it too idealy.

Sorry, I didn’t mean you: I saw some issues on github discussing broadcast between sparse and dense matrices. So it is surprising to me that the basic operation you checked isn’t working appropriately. Ah, wait a second: the proposed optimization doesn’t work for addition? Then you probably should really special case it in your code?

Haha, all good.

Yes, you are right, this optimization is exclusively useful for multiplication. Should probably be in the post title, I’ll edit it in.

Broadcasting a vector elementwise addition to each column would result in a mostly dense matrix, unless the vector itself was very sparse.

You could also express this as Diagonal(b) * A (using the Diagonal type from LinearAlgebra), since SparseArrays has a specialized method for this product.

2 Likes

The difference lies in b .* A vs. B.nzval .= B.nzval .* b[B.rowval] - the former manifests the full, dense matrix (with all zero elements as well), while the latter only saves what’s required - the nonzeros:

julia> b .* A |> typeof
Matrix{Int64} (alias for Array{Int64, 2})

julia> fun(b, A) |> typeof
SparseMatrixCSC{Int64, Int64}

You can probably avoid that full allocation if you pass in a sparse array as a receiving buffer and do B .= b .* A, since that may specialize on the setindex! during broadcasting.

By the way, there’s no need to collect your ranges - they are already <: AbstractVector and behave like their “full” counterparts, while taking up a miniscule fraction of their memory.

I’m not sure that is correct. On my machine the output type remains sparse.

julia> using SparseArrays
julia> n = 3
julia> b = collect(1:3)
3-element Array{Int64,1}:
 1
 2
 3
julia> a = sparse(collect(1:n),collect(1:n),rand(1:n))
3×3 SparseMatrixCSC{Int64,Int64} with 3 stored entries:
  [1, 1]  =  2
  [2, 2]  =  2
  [3, 3]  =  2
julia> b .* a
3×3 SparseMatrixCSC{Int64,Int64} with 3 stored entries:
  [1, 1]  =  2
  [2, 2]  =  4
  [3, 3]  =  6
julia> b .* a |> typeof
SparseMatrixCSC{Int64,Int64}

That does seem like a good work around solution. Also has marginally better performance compared to my little function.

That’s very interesting! I’m on a recent master, so either this is a regression or intentional :thinking:

Very curious indeed. The code I ran in response to you comment was on my private laptop which runs an older 1.5.2 (2020-09-23). But I’m 99% confident my work laptop, which runs a later version, though not the very latest, also remained sparse in output.