Diagonal elements of matrix product

e3c6 · November 23, 2019, 12:29pm

I have two matrices A and B, and I need to compute the diagonal elements of the product A*B as fast as possible, and store them in a pre-allocated vector.

What’s the fastest way to do this? I mean faster than writing my own loop (i.e., maybe hitting an appropriate BLAS routine, restructuring the input if needed).

simeonschaub · November 23, 2019, 12:44pm

You can use the eachrow and eachcol iterators to compute the dot products, which form the result’s diagonal.

using LinearAlgebra
A = rand(4, 4)
B = rand(4, 4)
v = similar(A, 4)

v .= dot.(eachrow(A), eachcol(B))

dot should use the corresponding strided BLAS routine.

e3c6 · November 23, 2019, 1:17pm

Another possibility is:

tmp .= A .* B'
sum!(result, tmp)

where tmp and result are pre-allocated and of appropriate sizes

e3c6 · November 23, 2019, 1:21pm

Some benchmarks.

julia> A = randn(10,10); B = randn(10, 10);

julia> tmp = randn(10, 10);

julia> function f1!(tmp, result, A, B)
       tmp .= A .* B'
       sum!(result, tmp)
       end

julia> function f2!(result, A, B)
       result .= dot.(eachrow(A), eachcol(B))
       end

julia> @btime f1!($tmp, $result, $A, $B);
  337.409 ns (0 allocations: 0 bytes)

julia> @btime f2!($result, $A, $B);
  784.857 ns (26 allocations: 1.34 KiB)

simeonschaub · November 23, 2019, 1:30pm

Are your matrices in your use case always only 10x10? For small sizes like this, your way is faster, but as you increase the size of your matrices, f2! should be a lot more efficient than f1!, as it will scale by O(n^2) instead of O(n^3).

ericphanson · November 23, 2019, 1:46pm

Why O(n^3)? It looks like the exact same operations additions and multiplications to me, just in a different order (Note also the multiplication is broadcasted).

KalelR · November 1, 2022, 10:05am

Hey

I ran now into this same issue, with a new spin: I need a fast way to obtain the diagonal elements only of a matrix multiplication for sparse matrices. Would really appreciate some help!

My problem currently is that the solutions proposed here seem to allocate: function f2! allocates even for normal (dense) matrices; and both f1 and f2 allocate when the matrices are sparse. My test is:

using BenchmarkTools, LinearAlgebra, Test, SparseArrays

function f1!(tmp, result, A, B)
    tmp .= A .* B'
    sum!(result, tmp)
    nothing
end

function f2!(result, A, B)
    result .= dot.(eachrow(A), eachcol(B))
    nothing
end

@testset "Diagonal of matrix multiplication" begin
    N = 1000
    @testset "Dense case" begin
        A = randn(N,N); B = randn(N,N);
        tmp = randn(N, N); result = randn(N);

        @btime f1!($tmp, $result, $A, $B);
        result_1 = deepcopy(result);

        @btime f2!($result, $A, $B);
        result_2 = deepcopy(result);

        @test result_1 ≈ result_2
    end

    @testset "Sparse case" begin
        d = 1e-4
        sA = sprand(N,N, d); sB = sprand(N,N, d);
        stmp = sA * sB'; sresult = randn(N);

        @btime f1!($stmp, $sresult, $sA, $sB);
        result_1s = deepcopy(sresult);

        @btime f2!($sresult, $sA, $sB);
        result_2s = deepcopy(sresult);

        @test result_1s ≈ result_2s
    end
end

Output for me is:

  3.139 ms (0 allocations: 0 bytes)
  2.137 ms (4 allocations: 78.22 KiB)
  12.051 μs (9 allocations: 17.78 KiB)
  22.253 μs (4 allocations: 140.72 KiB)

This multiplication is the main bottleneck in my code, as it is needed in a loop with several iterations. I have lots of memory available, and want speed. Would these allocations then be concerning? Are there improvements I can make?

Thanks a lot!

jishnub · November 1, 2022, 10:56am

These days, you may use Tullio, which seems faster on my laptop, and doesn’t allocate either:

julia> using Tullio

julia> function f3!(result, A, B)
           @tullio result[i] = A[i,j]*B[j,i]
           nothing
       end
f3! (generic function with 1 method)

julia> @btime f1!($tmp, $result, $A, $B);
  2.642 ms (0 allocations: 0 bytes)

julia> @btime f2!($result, $A, $B);
  1.775 ms (4 allocations: 78.22 KiB)

julia> @btime f3!($result, $A, $B);
  1.575 ms (0 allocations: 0 bytes)

I have not checked this for sparse matrices though

It seems removing the broadcasting gets rid of the allocations in f2:

julia> function f2!(result, A, B)
           for (ind, r, c) in zip(eachindex(result), eachrow(A), eachcol(B))
               result[ind] = dot(r, c)
           end
           nothing
       end
f2! (generic function with 1 method)

julia> @btime f2!($result, $A, $B);
  1.728 ms (0 allocations: 0 bytes)

KalelR · November 1, 2022, 1:16pm

Oh that’s really cool! Thanks for sharing! Tullio is indeed faster than the others for the whole range of N’s I tried (10 to 10000) for dense matrices.

But for sparse matrices it is much slower, I’m unsure why : (
For N = 1000, for instance, Tullio takes 3ms while f2! takes 47 μs.

Dan · November 1, 2022, 1:47pm

The reason f3! isn’t as fast as f2! is that Tullio doesn’t take advantage of sparsness. dot(...) in f2! is specialized for sparse matrices.
But sparse tricks are not symmetric with respect to columns vs. rows, therefore if you can generate the B matrix in a transposed form, you can achieve another 2x-ish speedup using:

function f2b!(result, A, Bt)
    for (ind, r, c) in zip(eachindex(result), eachcol(A), eachcol(B))
        result[ind] = dot(r, c)
    end
    nothing
end

In a benchmark with sparse matrices with 1% of elements nonzero, I get the following:

julia> @btime f2!($result, $M1, $M2)
  253.127 μs (0 allocations: 0 bytes)

julia> @btime f2b!($result, $M1, $M2)
  123.347 μs (0 allocations: 0 bytes)

Of course, the more the matrices are sparse, the more it helps.

KalelR · November 2, 2022, 10:53am

Ah, thanks for the explanation! Could you please share the test you made? For me here the speedup only occurs for dense matrices; for sparse, the transpose method f2b! is considerably slower than f2! but I may be messing something up. I used the function f2b! as


function f2_nobroadcasting_usetranspose!(result, A, Bt)
    for (ind, r, c) in zip(eachindex(result), eachrow(A), eachrow(Bt))
        result[ind] = dot(r, c)
    end
    nothing
end

(the code that you shared for yours is using B and eachcol(A), which I don’t think is correct).

Tests

using BenchmarkTools, LinearAlgebra, Test, SparseArrays
using Tullio, LoopVectorization

function f1!(tmp, result, A, B)
    tmp .= A .* B'
    sum!(result, tmp)
    nothing
end

function f2!(result, A, B)
    result .= dot.(eachrow(A), eachcol(B))
    nothing
end

function f2_nobroadcasting!(result, A, B)
    for (ind, r, c) in zip(eachindex(result), eachrow(A), eachcol(B))
        result[ind] = dot(r, c)
    end
    nothing
end

function f2_nobroadcasting_usetranspose!(result, A, Bt)
    for (ind, r, c) in zip(eachindex(result), eachrow(A), eachrow(Bt))
        result[ind] = dot(r, c)
    end
    nothing
end

function f3!(result, A, B)
    @tullio result[i] = A[i,j]*B[j,i]
    nothing
end

@testset verbose = true "Diagonal of matrix multiplication" begin
    N = 10000
    @testset "Dense case" begin
        A = randn(N,N); B = randn(N,N);
        tmp = randn(N, N); result = randn(N);

        @btime f1!($tmp, $result, $A, $B);
        result_1 = deepcopy(result);

        @btime f2!($result, $A, $B);
        result_2 = deepcopy(result);

        @btime f2_nobroadcasting!($result, $A, $B);
        result_2ndb = deepcopy(result);

        Bt = transpose(B)
        @btime f2_nobroadcasting_usetranspose!($result, $A, $Bt);
        result_2ndb_tran = deepcopy(result);

        @btime f3!($result, $A, $B);
        result_3 = deepcopy(result);

        @test result_1 ≈ result_2
        @test result_1 ≈ result_2ndb
        @test result_1 ≈ result_2ndb_tran
        @test result_1 ≈ result_3
    end

    @testset "Sparse case" begin
        d = 1e-4
        sA = sprand(N,N, d); sB = sprand(N,N, d);
        stmp = sA * sB'; sresult = randn(N);

        @btime f1!($stmp, $sresult, $sA, $sB);
        result_1s = deepcopy(sresult);

        @btime f2!($sresult, $sA, $sB);
        result_2s = deepcopy(sresult);

        @btime f2_nobroadcasting!($sresult, $sA, $sB);
        result_2ndbs = deepcopy(sresult);

        sBt = transpose(sB)
        @btime f2_nobroadcasting_usetranspose!($sresult, $sA, $sBt);
        result_2ndb_tran_s = deepcopy(sresult);

        @btime f3!($sresult, $sA, $sB);
        result_3s = deepcopy(sresult);

        @test result_1s ≈ result_2s
        @test result_1s == result_2ndb_tran_s
        @test result_1s == result_2ndbs
        @test result_1s == result_3s
    end
end

Gives

Results

#Dense
##N = 10
# 197.969 ns (0 allocations: 0 bytes)
# 348.294 ns (2 allocations: 992 bytes)
# 221.057 ns (0 allocations: 0 bytes)
# 156.080 ns (0 allocations: 0 bytes) **almost 2x speedup over non-transpose
# 71.441 ns (0 allocations: 0 bytes) ** Tullio
##N=1000
# 3.363 ms (0 allocations: 0 bytes)
# 2.207 ms (4 allocations: 78.22 KiB)
# 2.154 ms (0 allocations: 0 bytes)
# 2.316 ms (0 allocations: 0 bytes)
# 1.625 ms (0 allocations: 0 bytes) ** Tullio
##N = 10000
# 1.219 s (0 allocations: 0 bytes)
# 1.230 s (4 allocations: 781.34 KiB)
# 1.020 s (0 allocations: 0 bytes)
# 1.138 s (0 allocations: 0 bytes)
# 218.114 ms (0 allocations: 0 bytes) ***

#Sparse
##N=10
# 383.941 ns (7 allocations: 544 bytes)
# 282.670 ns (2 allocations: 1.59 KiB)
# 109.996 ns (0 allocations: 0 bytes) **
# 329.524 ns (0 allocations: 0 bytes) (transpose is slower!)
# 300.054 ns (0 allocations: 0 bytes)
##N = 1000
# 11.805 μs (9 allocations: 17.78 KiB)
# 22.150 μs (4 allocations: 140.72 KiB)
# 11.354 μs (0 allocations: 0 bytes) **
# 2.840 ms (0 allocations: 0 bytes)
# 3.092 ms (0 allocations: 0 bytes)
##N = 10000
# 395.792 μs (11 allocations: 311.41 KiB)
# 481.284 μs (4 allocations: 1.37 MiB)
# 346.121 μs (0 allocations: 0 bytes) **
# 1.405 s (0 allocations: 0 bytes)
# 649.908 ms (0 allocations: 0 bytes)

Dan · November 2, 2022, 10:56am

In the f2b! case you are using eachrow twice instead of eachcol twice. That’s the big difference (switch argument positions to make it same result if necessary).

KalelR · November 2, 2022, 11:26am

Ah great! Indeed then there is a speedup! Thanks a lot!

Topic		Replies	Views
Possible speedup in matrix-diagonal products? Performance linearalgebra , matrices	6	644	May 7, 2025
Dot product of multi-dimensional arrays New to Julia linearalgebra	5	2607	February 11, 2022
Fast `diag(A' * B * A)` General Usage performance , linearalgebra	22	871	May 4, 2023
Performance of computing the diagonal of V' * A *V Performance blas , linearalgebra	10	763	March 31, 2021
Why for-loop of vector multiplication is slower that dot product? General Usage question	11	455	August 11, 2022

Diagonal elements of matrix product

Related topics