Speeding up elementwise Vector-SparseMatrixCSC multiplication broadcasting

bcsj · April 13, 2022, 2:24pm

I noticed today a bottleneck in my code which was causing significant performance loss.

Given a vector b and a sparse matrix A, let b \odot A be the elementwise multiplication of b on each column of A. An example in Julia could look like this:

using SparseArrays
n = 3
b = collect(1:n)
A = sparse(collect(1:n), collect(1:n), rand(1:n))
b .* A

It seems however, that this is not natively handled very well as seen by cranking n up.

@time begin
   n = 3
   b = collect(1:n)
   A = sparse(collect(1:n), collect(1:n), rand(1:n))
   b .* A
end
0.000028 seconds (28 allocations: 2.375 KiB)

@time begin
   n = 10000
   b = collect(1:n)
   A = sparse(collect(1:n), collect(1:n), rand(1:n))
   b .* A
end
0.247870 seconds (51 allocations: 1.491 GiB, 25.08% gc time)

@time begin
   n = 100000
   b = collect(1:n)
   A = sparse(collect(1:n), collect(1:n), rand(1:n))
   b .* A
end
ERROR: OutOfMemoryError()

However, if I define the following function, which computes the same thing, I can do much better. It’s significantly faster and uses much less memory.

function fun(b::Vector, A::SparseMatrixCSC)
   B = copy(A)
   B.nzval .= B.nzval .* b[B.rowval]
   return B
end

@time begin
   n = 100000
   b = collect(1:n)
   A = sparse(collect(1:n), collect(1:n), rand(1:n))
   fun(b, A)
end
0.005198 seconds (43 allocations: 11.446 MiB)

@time begin
   n = Int(1e6)
   b = collect(1:n)
   A = sparse(collect(1:n), collect(1:n), rand(1:n))
   fun(b, A)
end
0.068245 seconds (43 allocations: 114.443 MiB, 7.60% gc time)

Is there a reason .* does not behave internally in this way? Can I overload it myself to make it behave in this way? I looked a bit at Base.broadcast and I have made custom broadcasting for my own callable struct types before, in places where I could gain significant speed-ups by computing the elementwise operations differently than a naive elementwise function call. But I couldn’t seem to figure out how overloading broadcasting for * is supposed to be expressed.

goerch · April 13, 2022, 3:51pm

It looks like this is one of the more sane use cases for sparse matrix broadcasts, so shouldn’t it be supported by SparseArrays?

bcsj · April 13, 2022, 4:19pm

Thanks, I’m always happy to be affirmed in my sanity!

That aside, yes, SparseArrays is of course where I’d put it too idealy.

goerch · April 13, 2022, 4:29pm

Sorry, I didn’t mean you: I saw some issues on github discussing broadcast between sparse and dense matrices. So it is surprising to me that the basic operation you checked isn’t working appropriately. Ah, wait a second: the proposed optimization doesn’t work for addition? Then you probably should really special case it in your code?

bcsj · April 13, 2022, 4:36pm

Haha, all good.

Yes, you are right, this optimization is exclusively useful for multiplication. Should probably be in the post title, I’ll edit it in.

Broadcasting a vector elementwise addition to each column would result in a mostly dense matrix, unless the vector itself was very sparse.

stevengj · April 13, 2022, 5:04pm

You could also express this as Diagonal(b) * A (using the Diagonal type from LinearAlgebra), since SparseArrays has a specialized method for this product.

Sukera · April 13, 2022, 5:33pm

The difference lies in b .* A vs. B.nzval .= B.nzval .* b[B.rowval] - the former manifests the full, dense matrix (with all zero elements as well), while the latter only saves what’s required - the nonzeros:

julia> b .* A |> typeof
Matrix{Int64} (alias for Array{Int64, 2})

julia> fun(b, A) |> typeof
SparseMatrixCSC{Int64, Int64}

You can probably avoid that full allocation if you pass in a sparse array as a receiving buffer and do B .= b .* A, since that may specialize on the setindex! during broadcasting.

By the way, there’s no need to collect your ranges - they are already <: AbstractVector and behave like their “full” counterparts, while taking up a miniscule fraction of their memory.

bcsj · April 13, 2022, 5:40pm

I’m not sure that is correct. On my machine the output type remains sparse.

julia> using SparseArrays
julia> n = 3
julia> b = collect(1:3)
3-element Array{Int64,1}:
 1
 2
 3
julia> a = sparse(collect(1:n),collect(1:n),rand(1:n))
3×3 SparseMatrixCSC{Int64,Int64} with 3 stored entries:
  [1, 1]  =  2
  [2, 2]  =  2
  [3, 3]  =  2
julia> b .* a
3×3 SparseMatrixCSC{Int64,Int64} with 3 stored entries:
  [1, 1]  =  2
  [2, 2]  =  4
  [3, 3]  =  6
julia> b .* a |> typeof
SparseMatrixCSC{Int64,Int64}

bcsj · April 13, 2022, 5:50pm

That does seem like a good work around solution. Also has marginally better performance compared to my little function.

Sukera · April 13, 2022, 6:34pm

That’s very interesting! I’m on a recent master, so either this is a regression or intentional

bcsj · April 13, 2022, 7:08pm

Very curious indeed. The code I ran in response to you comment was on my private laptop which runs an older 1.5.2 (2020-09-23). But I’m 99% confident my work laptop, which runs a later version, though not the very latest, also remained sparse in output.

axsk · December 12, 2024, 9:45am

This still is not working out of the box, right?

Topic		Replies	Views
`sparsevec` operations General Usage	2	539	April 1, 2017
SparseMatrix is very slow General Usage question	9	2010	March 31, 2020
Very sparse matrix but still out-of-memory? General Usage	6	1279	June 19, 2018
Asymmetric speed of in-place `sparse*dense` matrix product General Usage	7	1559	November 8, 2018
Scaling a sparse matrix row-wise and column-wise too slow Performance broadcast , sparse	20	520	June 23, 2024

Speeding up elementwise Vector-SparseMatrixCSC multiplication broadcasting

Related topics