Diagonal elements of matrix product

I have two matrices A and B, and I need to compute the diagonal elements of the product A*B as fast as possible, and store them in a pre-allocated vector.

What’s the fastest way to do this? I mean faster than writing my own loop (i.e., maybe hitting an appropriate BLAS routine, restructuring the input if needed).

You can use the eachrow and eachcol iterators to compute the dot products, which form the result’s diagonal.

using LinearAlgebra
A = rand(4, 4)
B = rand(4, 4)
v = similar(A, 4)

v .= dot.(eachrow(A), eachcol(B))

dot should use the corresponding strided BLAS routine.

4 Likes

Another possibility is:

tmp .= A .* B'
sum!(result, tmp)

where tmp and result are pre-allocated and of appropriate sizes

Some benchmarks.

julia> A = randn(10,10); B = randn(10, 10);

julia> tmp = randn(10, 10);

julia> function f1!(tmp, result, A, B)
       tmp .= A .* B'
       sum!(result, tmp)
       end

julia> function f2!(result, A, B)
       result .= dot.(eachrow(A), eachcol(B))
       end

julia> @btime f1!($tmp, $result, $A, $B);
  337.409 ns (0 allocations: 0 bytes)

julia> @btime f2!($result, $A, $B);
  784.857 ns (26 allocations: 1.34 KiB)
1 Like

Are your matrices in your use case always only 10x10? For small sizes like this, your way is faster, but as you increase the size of your matrices, f2! should be a lot more efficient than f1!, as it will scale by O(n^2) instead of O(n^3).

1 Like

Why O(n^3)? It looks like the exact same operations additions and multiplications to me, just in a different order (Note also the multiplication is broadcasted).

1 Like

Hey

I ran now into this same issue, with a new spin: I need a fast way to obtain the diagonal elements only of a matrix multiplication for sparse matrices. Would really appreciate some help!

My problem currently is that the solutions proposed here seem to allocate: function f2! allocates even for normal (dense) matrices; and both f1 and f2 allocate when the matrices are sparse. My test is:

using BenchmarkTools, LinearAlgebra, Test, SparseArrays

function f1!(tmp, result, A, B)
    tmp .= A .* B'
    sum!(result, tmp)
    nothing
end

function f2!(result, A, B)
    result .= dot.(eachrow(A), eachcol(B))
    nothing
end

@testset "Diagonal of matrix multiplication" begin
    N = 1000
    @testset "Dense case" begin
        A = randn(N,N); B = randn(N,N);
        tmp = randn(N, N); result = randn(N);

        @btime f1!($tmp, $result, $A, $B);
        result_1 = deepcopy(result);

        @btime f2!($result, $A, $B);
        result_2 = deepcopy(result);

        @test result_1 ≈ result_2
    end

    @testset "Sparse case" begin
        d = 1e-4
        sA = sprand(N,N, d); sB = sprand(N,N, d);
        stmp = sA * sB'; sresult = randn(N);

        @btime f1!($stmp, $sresult, $sA, $sB);
        result_1s = deepcopy(sresult);

        @btime f2!($sresult, $sA, $sB);
        result_2s = deepcopy(sresult);

        @test result_1s ≈ result_2s
    end
end

Output for me is:

  3.139 ms (0 allocations: 0 bytes)
  2.137 ms (4 allocations: 78.22 KiB)
  12.051 μs (9 allocations: 17.78 KiB)
  22.253 μs (4 allocations: 140.72 KiB)

This multiplication is the main bottleneck in my code, as it is needed in a loop with several iterations. I have lots of memory available, and want speed. Would these allocations then be concerning? Are there improvements I can make?

Thanks a lot!

These days, you may use Tullio, which seems faster on my laptop, and doesn’t allocate either:

julia> using Tullio

julia> function f3!(result, A, B)
           @tullio result[i] = A[i,j]*B[j,i]
           nothing
       end
f3! (generic function with 1 method)

julia> @btime f1!($tmp, $result, $A, $B);
  2.642 ms (0 allocations: 0 bytes)

julia> @btime f2!($result, $A, $B);
  1.775 ms (4 allocations: 78.22 KiB)

julia> @btime f3!($result, $A, $B);
  1.575 ms (0 allocations: 0 bytes)

I have not checked this for sparse matrices though

It seems removing the broadcasting gets rid of the allocations in f2:

julia> function f2!(result, A, B)
           for (ind, r, c) in zip(eachindex(result), eachrow(A), eachcol(B))
               result[ind] = dot(r, c)
           end
           nothing
       end
f2! (generic function with 1 method)

julia> @btime f2!($result, $A, $B);
  1.728 ms (0 allocations: 0 bytes)
2 Likes

Oh that’s really cool! Thanks for sharing! Tullio is indeed faster than the others for the whole range of N’s I tried (10 to 10000) for dense matrices.

But for sparse matrices it is much slower, I’m unsure why : (
For N = 1000, for instance, Tullio takes 3ms while f2! takes 47 μs.

1 Like

The reason f3! isn’t as fast as f2! is that Tullio doesn’t take advantage of sparsness. dot(...) in f2! is specialized for sparse matrices.
But sparse tricks are not symmetric with respect to columns vs. rows, therefore if you can generate the B matrix in a transposed form, you can achieve another 2x-ish speedup using:

function f2b!(result, A, Bt)
    for (ind, r, c) in zip(eachindex(result), eachcol(A), eachcol(B))
        result[ind] = dot(r, c)
    end
    nothing
end

In a benchmark with sparse matrices with 1% of elements nonzero, I get the following:

julia> @btime f2!($result, $M1, $M2)
  253.127 μs (0 allocations: 0 bytes)

julia> @btime f2b!($result, $M1, $M2)
  123.347 μs (0 allocations: 0 bytes)

Of course, the more the matrices are sparse, the more it helps.

1 Like

Ah, thanks for the explanation! Could you please share the test you made? For me here the speedup only occurs for dense matrices; for sparse, the transpose method f2b! is considerably slower than f2! but I may be messing something up. I used the function f2b! as


function f2_nobroadcasting_usetranspose!(result, A, Bt)
    for (ind, r, c) in zip(eachindex(result), eachrow(A), eachrow(Bt))
        result[ind] = dot(r, c)
    end
    nothing
end

(the code that you shared for yours is using B and eachcol(A), which I don’t think is correct).

Tests
using BenchmarkTools, LinearAlgebra, Test, SparseArrays
using Tullio, LoopVectorization

function f1!(tmp, result, A, B)
    tmp .= A .* B'
    sum!(result, tmp)
    nothing
end

function f2!(result, A, B)
    result .= dot.(eachrow(A), eachcol(B))
    nothing
end

function f2_nobroadcasting!(result, A, B)
    for (ind, r, c) in zip(eachindex(result), eachrow(A), eachcol(B))
        result[ind] = dot(r, c)
    end
    nothing
end

function f2_nobroadcasting_usetranspose!(result, A, Bt)
    for (ind, r, c) in zip(eachindex(result), eachrow(A), eachrow(Bt))
        result[ind] = dot(r, c)
    end
    nothing
end

function f3!(result, A, B)
    @tullio result[i] = A[i,j]*B[j,i]
    nothing
end

@testset verbose = true "Diagonal of matrix multiplication" begin
    N = 10000
    @testset "Dense case" begin
        A = randn(N,N); B = randn(N,N);
        tmp = randn(N, N); result = randn(N);

        @btime f1!($tmp, $result, $A, $B);
        result_1 = deepcopy(result);

        @btime f2!($result, $A, $B);
        result_2 = deepcopy(result);

        @btime f2_nobroadcasting!($result, $A, $B);
        result_2ndb = deepcopy(result);

        Bt = transpose(B)
        @btime f2_nobroadcasting_usetranspose!($result, $A, $Bt);
        result_2ndb_tran = deepcopy(result);

        @btime f3!($result, $A, $B);
        result_3 = deepcopy(result);

        @test result_1 ≈ result_2
        @test result_1 ≈ result_2ndb
        @test result_1 ≈ result_2ndb_tran
        @test result_1 ≈ result_3
    end

    @testset "Sparse case" begin
        d = 1e-4
        sA = sprand(N,N, d); sB = sprand(N,N, d);
        stmp = sA * sB'; sresult = randn(N);

        @btime f1!($stmp, $sresult, $sA, $sB);
        result_1s = deepcopy(sresult);

        @btime f2!($sresult, $sA, $sB);
        result_2s = deepcopy(sresult);

        @btime f2_nobroadcasting!($sresult, $sA, $sB);
        result_2ndbs = deepcopy(sresult);

        sBt = transpose(sB)
        @btime f2_nobroadcasting_usetranspose!($sresult, $sA, $sBt);
        result_2ndb_tran_s = deepcopy(sresult);

        @btime f3!($sresult, $sA, $sB);
        result_3s = deepcopy(sresult);

        @test result_1s ≈ result_2s
        @test result_1s == result_2ndb_tran_s
        @test result_1s == result_2ndbs
        @test result_1s == result_3s
    end
end

Gives

Results
#Dense
##N = 10
# 197.969 ns (0 allocations: 0 bytes)
# 348.294 ns (2 allocations: 992 bytes)
# 221.057 ns (0 allocations: 0 bytes)
# 156.080 ns (0 allocations: 0 bytes) **almost 2x speedup over non-transpose
# 71.441 ns (0 allocations: 0 bytes) ** Tullio
##N=1000
# 3.363 ms (0 allocations: 0 bytes)
# 2.207 ms (4 allocations: 78.22 KiB)
# 2.154 ms (0 allocations: 0 bytes)
# 2.316 ms (0 allocations: 0 bytes)
# 1.625 ms (0 allocations: 0 bytes) ** Tullio
##N = 10000
# 1.219 s (0 allocations: 0 bytes)
# 1.230 s (4 allocations: 781.34 KiB)
# 1.020 s (0 allocations: 0 bytes)
# 1.138 s (0 allocations: 0 bytes)
# 218.114 ms (0 allocations: 0 bytes) ***

#Sparse
##N=10
# 383.941 ns (7 allocations: 544 bytes)
# 282.670 ns (2 allocations: 1.59 KiB)
# 109.996 ns (0 allocations: 0 bytes) **
# 329.524 ns (0 allocations: 0 bytes) (transpose is slower!)
# 300.054 ns (0 allocations: 0 bytes)
##N = 1000
# 11.805 μs (9 allocations: 17.78 KiB)
# 22.150 μs (4 allocations: 140.72 KiB)
# 11.354 μs (0 allocations: 0 bytes) **
# 2.840 ms (0 allocations: 0 bytes)
# 3.092 ms (0 allocations: 0 bytes)
##N = 10000
# 395.792 μs (11 allocations: 311.41 KiB)
# 481.284 μs (4 allocations: 1.37 MiB)
# 346.121 μs (0 allocations: 0 bytes) **
# 1.405 s (0 allocations: 0 bytes)
# 649.908 ms (0 allocations: 0 bytes)

In the f2b! case you are using eachrow twice instead of eachcol twice. That’s the big difference (switch argument positions to make it same result if necessary).

2 Likes

Ah great! Indeed then there is a speedup! Thanks a lot!