Possible speedup in matrix-diagonal products?

Right multiplication by a Diagonal matrix effectively scales the columns of a matrix. Currently the default Matrix-Diagonal product appears to use copy_similar:

(*)(A::AbstractMatrix, D::Diagonal) =
    rmul!(copy_similar(A, promote_op(*, eltype(A), eltype(D.diag))), D)

which seems a bit wasteful. Wouldn’t an uninitialized matrix be better as the destination here? I compare a naive implementation of this matrix product with and without bounds-checking disabled, and it appears that the present implementation is even less performant than the version with bounds checking.

julia> A = ones(3000, 2000); D = Diagonal(ones(2000));

julia> @btime $A * $D;
  42.105 ms (4 allocations: 45.78 MiB)

julia> Base.@propagate_inbounds function muldiag(A::Matrix{T}, D::Diagonal{R}) where {R,T}
           @assert size(A,2) == size(D,1) "sizes are incompatible"
           C = similar(A, promote_type(T, R)) # some promotion operation
           d = diag(D);
           for (j, dj) in enumerate(d), i in axes(C, 1) 
               C[i, j] = dj * A[i, j]
           end
           return C
       end
muldiag (generic function with 1 method)

julia> @btime muldiag($A, $D);
  36.357 ms (3 allocations: 45.79 MiB)

julia> @btime @inbounds muldiag($A, $D);
  28.553 ms (3 allocations: 45.79 MiB)

julia> A * D == muldiag(A, D)
true

Perhaps there are instances where the copy is essential, but it seems to be an expensive operation

julia> @btime LinearAlgebra.copy_similar($A, Float64);
  34.992 ms (2 allocations: 45.78 MiB)

and it appears that we may do without it after all?

Using mul! instead of rmul! seems to resolve this:

julia> muldiag(A, D) = mul!(similar(A, promote_type(eltype(A), eltype(D))), A, D);

julia> @btime muldiag($A, $D);
  28.734 ms (4 allocations: 45.78 MiB)
6 Likes

I’ve created a PR to add this for matrices of numbers

4 Likes

I wonder what is the status quo. Since I do not see a special dispatch. And I can write a simple function that outperforms the default in LinearAlgebra.jl. @jishnub

julia> using LinearAlgebra

julia> function m1(a, D) return a  * D end;

julia> function m2(a, d) return a .* transpose(d) end;

julia> N = 17000;

julia> a = rand(-9:9, N, N); d = rand(1:9, N); D = Diagonal(d);

julia> @time m1(a, D);
  0.569151 seconds (125.66 k allocations: 2.160 GiB, 9.07% gc time, 10.87% compilation time)

julia> @time m2(a, d);
  0.602701 seconds (403.15 k allocations: 2.173 GiB, 0.57% gc time, 24.05% compilation time)

julia> a = rand(-9:9, N, N); d = rand(1:9, N); D = Diagonal(d);

julia> @time m1(a, D);
  0.702514 seconds (3 allocations: 2.153 GiB, 30.04% gc time)

julia> @time m2(a, d);
  0.488113 seconds (3 allocations: 2.153 GiB, 0.36% gc time)

julia> a = rand(-9:9, N, N); d = rand(1:9, N); D = Diagonal(d);

julia> @time m1(a, D);
  0.699434 seconds (3 allocations: 2.153 GiB, 30.72% gc time)

julia> @time m2(a, d);
  0.454129 seconds (3 allocations: 2.153 GiB, 0.17% gc time)

julia> a = rand(-9:9, N, N); d = rand(1:9, N); D = Diagonal(d);

julia> @time m1(a, D);
  0.689136 seconds (3 allocations: 2.153 GiB, 31.86% gc time)

julia> @time m2(a, d);
  0.480376 seconds (3 allocations: 2.153 GiB, 0.17% gc time)

julia> @which a * D
*(A::AbstractMatrix, B::AbstractMatrix)
     @ LinearAlgebra K:\julia-1.11.5\share\julia\stdlib\v1.11\LinearAlgebra\src\matmul.jl:112

Use @btime from BenchmarkTools.jl instead of @time, and make sure to interpolate the input values:

julia> using LinearAlgebra, BenchmarkTools

julia> N = 17000; a = rand(-9:9, N, N); d = rand(1:9, N); D = Diagonal(d);

julia> m1(a, D) = a * D
m1 (generic function with 1 method)

julia> m2(a, d) = a .* transpose(d)
m2 (generic function with 1 method)

julia> @btime m1($a, $D);
  649.763 ms (3 allocations: 2.15 GiB)

julia> @btime m2($a, $d);
  960.526 ms (3 allocations: 2.15 GiB)

You’d have to dig deeper in the call chain to find the specialized method. I believe they are here: LinearAlgebra.jl/src/diagonal.jl at 7f354f40a26bacbbe9ff747c9a29d1ae08eb5193 · JuliaLang/LinearAlgebra.jl · GitHub

2 Likes

Thanks. I’ve learned that package.

dig deeper in the call chain

I really want to, but I’m having some trouble with the Debugger package.
@which can only reveal the top layer. Are you aware of some alternative ways to let the user fathom the stacktrace?

Cthulhu.jl might work

3 Likes