I noticed today a bottleneck in my code which was causing significant performance loss.
Given a vector b and a sparse matrix A, let b \odot A be the elementwise multiplication of b on each column of A. An example in Julia could look like this:
using SparseArrays
n = 3
b = collect(1:n)
A = sparse(collect(1:n), collect(1:n), rand(1:n))
b .* A
It seems however, that this is not natively handled very well as seen by cranking n up.
@time begin
n = 3
b = collect(1:n)
A = sparse(collect(1:n), collect(1:n), rand(1:n))
b .* A
end
0.000028 seconds (28 allocations: 2.375 KiB)
@time begin
n = 10000
b = collect(1:n)
A = sparse(collect(1:n), collect(1:n), rand(1:n))
b .* A
end
0.247870 seconds (51 allocations: 1.491 GiB, 25.08% gc time)
@time begin
n = 100000
b = collect(1:n)
A = sparse(collect(1:n), collect(1:n), rand(1:n))
b .* A
end
ERROR: OutOfMemoryError()
However, if I define the following function, which computes the same thing, I can do much better. It’s significantly faster and uses much less memory.
function fun(b::Vector, A::SparseMatrixCSC)
B = copy(A)
B.nzval .= B.nzval .* b[B.rowval]
return B
end
@time begin
n = 100000
b = collect(1:n)
A = sparse(collect(1:n), collect(1:n), rand(1:n))
fun(b, A)
end
0.005198 seconds (43 allocations: 11.446 MiB)
@time begin
n = Int(1e6)
b = collect(1:n)
A = sparse(collect(1:n), collect(1:n), rand(1:n))
fun(b, A)
end
0.068245 seconds (43 allocations: 114.443 MiB, 7.60% gc time)
Is there a reason .*
does not behave internally in this way? Can I overload it myself to make it behave in this way? I looked a bit at Base.broadcast
and I have made custom broadcasting for my own callable struct types before, in places where I could gain significant speed-ups by computing the elementwise operations differently than a naive elementwise function call. But I couldn’t seem to figure out how overloading broadcasting for *
is supposed to be expressed.