# Diagonal elements of matrix product

I have two matrices `A` and `B`, and I need to compute the diagonal elements of the product `A*B` as fast as possible, and store them in a pre-allocated vector.

What’s the fastest way to do this? I mean faster than writing my own loop (i.e., maybe hitting an appropriate BLAS routine, restructuring the input if needed).

You can use the `eachrow` and `eachcol` iterators to compute the dot products, which form the result’s diagonal.

``````using LinearAlgebra
A = rand(4, 4)
B = rand(4, 4)
v = similar(A, 4)

v .= dot.(eachrow(A), eachcol(B))
``````

`dot` should use the corresponding strided BLAS routine.

3 Likes

Another possibility is:

``````tmp .= A .* B'
sum!(result, tmp)
``````

where `tmp` and `result` are pre-allocated and of appropriate sizes

Some benchmarks.

``````julia> A = randn(10,10); B = randn(10, 10);

julia> tmp = randn(10, 10);

julia> function f1!(tmp, result, A, B)
tmp .= A .* B'
sum!(result, tmp)
end

julia> function f2!(result, A, B)
result .= dot.(eachrow(A), eachcol(B))
end

julia> @btime f1!(\$tmp, \$result, \$A, \$B);
337.409 ns (0 allocations: 0 bytes)

julia> @btime f2!(\$result, \$A, \$B);
784.857 ns (26 allocations: 1.34 KiB)
``````
1 Like

Are your matrices in your use case always only 10x10? For small sizes like this, your way is faster, but as you increase the size of your matrices, `f2!` should be a lot more efficient than `f1!`, as it will scale by O(n^2) instead of O(n^3).

1 Like

Why O(n^3)? It looks like the exact same operations additions and multiplications to me, just in a different order (Note also the multiplication is broadcasted).

1 Like

Hey

I ran now into this same issue, with a new spin: I need a fast way to obtain the diagonal elements only of a matrix multiplication for sparse matrices. Would really appreciate some help!

My problem currently is that the solutions proposed here seem to allocate: function `f2!` allocates even for normal (dense) matrices; and both `f1` and `f2` allocate when the matrices are sparse. My test is:

``````using BenchmarkTools, LinearAlgebra, Test, SparseArrays

function f1!(tmp, result, A, B)
tmp .= A .* B'
sum!(result, tmp)
nothing
end

function f2!(result, A, B)
result .= dot.(eachrow(A), eachcol(B))
nothing
end

@testset "Diagonal of matrix multiplication" begin
N = 1000
@testset "Dense case" begin
A = randn(N,N); B = randn(N,N);
tmp = randn(N, N); result = randn(N);

@btime f1!(\$tmp, \$result, \$A, \$B);
result_1 = deepcopy(result);

@btime f2!(\$result, \$A, \$B);
result_2 = deepcopy(result);

@test result_1 ≈ result_2
end

@testset "Sparse case" begin
d = 1e-4
sA = sprand(N,N, d); sB = sprand(N,N, d);
stmp = sA * sB'; sresult = randn(N);

@btime f1!(\$stmp, \$sresult, \$sA, \$sB);
result_1s = deepcopy(sresult);

@btime f2!(\$sresult, \$sA, \$sB);
result_2s = deepcopy(sresult);

@test result_1s ≈ result_2s
end
end

``````

Output for me is:

``````  3.139 ms (0 allocations: 0 bytes)
2.137 ms (4 allocations: 78.22 KiB)
12.051 μs (9 allocations: 17.78 KiB)
22.253 μs (4 allocations: 140.72 KiB)
``````

This multiplication is the main bottleneck in my code, as it is needed in a loop with several iterations. I have lots of memory available, and want speed. Would these allocations then be concerning? Are there improvements I can make?

Thanks a lot!

These days, you may use `Tullio`, which seems faster on my laptop, and doesn’t allocate either:

``````julia> using Tullio

julia> function f3!(result, A, B)
@tullio result[i] = A[i,j]*B[j,i]
nothing
end
f3! (generic function with 1 method)

julia> @btime f1!(\$tmp, \$result, \$A, \$B);
2.642 ms (0 allocations: 0 bytes)

julia> @btime f2!(\$result, \$A, \$B);
1.775 ms (4 allocations: 78.22 KiB)

julia> @btime f3!(\$result, \$A, \$B);
1.575 ms (0 allocations: 0 bytes)
``````

I have not checked this for sparse matrices though

It seems removing the broadcasting gets rid of the allocations in `f2`:

``````julia> function f2!(result, A, B)
for (ind, r, c) in zip(eachindex(result), eachrow(A), eachcol(B))
result[ind] = dot(r, c)
end
nothing
end
f2! (generic function with 1 method)

julia> @btime f2!(\$result, \$A, \$B);
1.728 ms (0 allocations: 0 bytes)
``````
2 Likes

Oh that’s really cool! Thanks for sharing! Tullio is indeed faster than the others for the whole range of N’s I tried (10 to 10000) for dense matrices.

But for sparse matrices it is much slower, I’m unsure why : (
For N = 1000, for instance, Tullio takes 3ms while `f2!` takes 47 μs.

1 Like

The reason `f3!` isn’t as fast as `f2!` is that `Tullio` doesn’t take advantage of sparsness. `dot(...)` in `f2!` is specialized for sparse matrices.
But sparse tricks are not symmetric with respect to columns vs. rows, therefore if you can generate the `B` matrix in a transposed form, you can achieve another 2x-ish speedup using:

``````function f2b!(result, A, Bt)
for (ind, r, c) in zip(eachindex(result), eachcol(A), eachcol(B))
result[ind] = dot(r, c)
end
nothing
end
``````

In a benchmark with sparse matrices with 1% of elements nonzero, I get the following:

``````julia> @btime f2!(\$result, \$M1, \$M2)
253.127 μs (0 allocations: 0 bytes)

julia> @btime f2b!(\$result, \$M1, \$M2)
123.347 μs (0 allocations: 0 bytes)
``````

Of course, the more the matrices are sparse, the more it helps.

1 Like

Ah, thanks for the explanation! Could you please share the test you made? For me here the speedup only occurs for dense matrices; for sparse, the transpose method `f2b!` is considerably slower than `f2!` but I may be messing something up. I used the function `f2b!` as

``````
for (ind, r, c) in zip(eachindex(result), eachrow(A), eachrow(Bt))
result[ind] = dot(r, c)
end
nothing
end
``````

(the code that you shared for yours is using B and eachcol(A), which I don’t think is correct).

Tests
``````using BenchmarkTools, LinearAlgebra, Test, SparseArrays
using Tullio, LoopVectorization

function f1!(tmp, result, A, B)
tmp .= A .* B'
sum!(result, tmp)
nothing
end

function f2!(result, A, B)
result .= dot.(eachrow(A), eachcol(B))
nothing
end

for (ind, r, c) in zip(eachindex(result), eachrow(A), eachcol(B))
result[ind] = dot(r, c)
end
nothing
end

for (ind, r, c) in zip(eachindex(result), eachrow(A), eachrow(Bt))
result[ind] = dot(r, c)
end
nothing
end

function f3!(result, A, B)
@tullio result[i] = A[i,j]*B[j,i]
nothing
end

@testset verbose = true "Diagonal of matrix multiplication" begin
N = 10000
@testset "Dense case" begin
A = randn(N,N); B = randn(N,N);
tmp = randn(N, N); result = randn(N);

@btime f1!(\$tmp, \$result, \$A, \$B);
result_1 = deepcopy(result);

@btime f2!(\$result, \$A, \$B);
result_2 = deepcopy(result);

result_2ndb = deepcopy(result);

Bt = transpose(B)
result_2ndb_tran = deepcopy(result);

@btime f3!(\$result, \$A, \$B);
result_3 = deepcopy(result);

@test result_1 ≈ result_2
@test result_1 ≈ result_2ndb
@test result_1 ≈ result_2ndb_tran
@test result_1 ≈ result_3
end

@testset "Sparse case" begin
d = 1e-4
sA = sprand(N,N, d); sB = sprand(N,N, d);
stmp = sA * sB'; sresult = randn(N);

@btime f1!(\$stmp, \$sresult, \$sA, \$sB);
result_1s = deepcopy(sresult);

@btime f2!(\$sresult, \$sA, \$sB);
result_2s = deepcopy(sresult);

result_2ndbs = deepcopy(sresult);

sBt = transpose(sB)
result_2ndb_tran_s = deepcopy(sresult);

@btime f3!(\$sresult, \$sA, \$sB);
result_3s = deepcopy(sresult);

@test result_1s ≈ result_2s
@test result_1s == result_2ndb_tran_s
@test result_1s == result_2ndbs
@test result_1s == result_3s
end
end
``````

Gives

Results
``````#Dense
##N = 10
# 197.969 ns (0 allocations: 0 bytes)
# 348.294 ns (2 allocations: 992 bytes)
# 221.057 ns (0 allocations: 0 bytes)
# 156.080 ns (0 allocations: 0 bytes) **almost 2x speedup over non-transpose
# 71.441 ns (0 allocations: 0 bytes) ** Tullio
##N=1000
# 3.363 ms (0 allocations: 0 bytes)
# 2.207 ms (4 allocations: 78.22 KiB)
# 2.154 ms (0 allocations: 0 bytes)
# 2.316 ms (0 allocations: 0 bytes)
# 1.625 ms (0 allocations: 0 bytes) ** Tullio
##N = 10000
# 1.219 s (0 allocations: 0 bytes)
# 1.230 s (4 allocations: 781.34 KiB)
# 1.020 s (0 allocations: 0 bytes)
# 1.138 s (0 allocations: 0 bytes)
# 218.114 ms (0 allocations: 0 bytes) ***

#Sparse
##N=10
# 383.941 ns (7 allocations: 544 bytes)
# 282.670 ns (2 allocations: 1.59 KiB)
# 109.996 ns (0 allocations: 0 bytes) **
# 329.524 ns (0 allocations: 0 bytes) (transpose is slower!)
# 300.054 ns (0 allocations: 0 bytes)
##N = 1000
# 11.805 μs (9 allocations: 17.78 KiB)
# 22.150 μs (4 allocations: 140.72 KiB)
# 11.354 μs (0 allocations: 0 bytes) **
# 2.840 ms (0 allocations: 0 bytes)
# 3.092 ms (0 allocations: 0 bytes)
##N = 10000
# 395.792 μs (11 allocations: 311.41 KiB)
# 481.284 μs (4 allocations: 1.37 MiB)
# 346.121 μs (0 allocations: 0 bytes) **
# 1.405 s (0 allocations: 0 bytes)
# 649.908 ms (0 allocations: 0 bytes)
``````

In the `f2b!` case you are using `eachrow` twice instead of `eachcol` twice. That’s the big difference (switch argument positions to make it same result if necessary).

2 Likes

Ah great! Indeed then there is a speedup! Thanks a lot!