Fast diag(A'*B)

Tamas_Papp · October 1, 2019, 12:40pm

Is there a fast way to compute dot products of the columns of two similar matrices? MWE:

using LinearAlgebra, BenchmarkTools

f1(A, B) = diag(A'*B)           # suboptimal, but surprisingly fast
f2(A, B) = map(dot, eachcol(A), eachcol(B)) # map

A = randn(150, 25)
B = randn(size(A)...)

@belapsed f1(A, B)              # 9.2 μs
@belapsed f2(A, B)              # 1.5 μs

I can write a loop, but I was kind of hoping there is something in BLAS already, but could not find anything.

improbable22 · October 1, 2019, 2:11pm

This isn’t quite what you want, but since it’s faster than f1 they must be doing something right…

using Distances
@belapsed colwise(CosineDist(), A, B)  # 3.2 μs

ericphanson · October 1, 2019, 6:17pm

You can do sum(A .* B; dims = 1) assuming the ’ is a transpose (otherwise you need an extra complex conjugate on A). This is just that (A’B)_{ii} = \sum_j (A’)_{ij} B_{ji} = \sum_j A_{ji} B_{ji} in the real case.

Edit: I’m on my phone, otherwise I’d check it numerically and benchmark. But maybe it’s fast?

Edit2: fixed index in sums

Tamas_Papp · October 1, 2019, 6:30pm

That forms an intermediate matrix, and is much worse than map(dot, ...):

f4(A, B) = vec(sum(A .* B; dims = 1))
@belapsed f4(A, B)              # 3.5 μs

If there is no BLAS for this, then I guess I will stick with map(dot, ...).

Per · October 1, 2019, 6:32pm

I don’t think there’s much optimization that can be done here, other that what’s already done in dot. You could make the loop over columns multi-threaded, but this is going to be memory bound anyways, so…

ericphanson · October 1, 2019, 6:33pm

(Thanks fixing the index in your reply). If you need to do this many times you could preallocate that intermediary. Not sure if that would beat the map option though. I’m not a BLAS expert by any means, but that trick has been useful in the full-trace case for me (full sum instead of row sums).

Mason · October 1, 2019, 9:05pm

I can’t find a solution that outperforms map(dot, ...) here on runtime, but if you’re worried about allocations, one option would be this:

using TensorCast
f3(A, B) = @reduce C[i] := sum(j) A'[i, j] * B[j, i] lazy

A = randn(150, 25)
B = randn(size(A)...)

@btime f1($A, $B) # 9.777 μs (2 allocations: 5.34 KiB)
@btime f2($A, $B) # 1.475 μs (58 allocations: 2.78 KiB)
@btime f3($A, $B) # 8.523 μs (16 allocations: 688 bytes)

so you end up trading ~6x runtime speed for ~1/4 the memory usage. Often not worth it. The runtime overhead of TensorCast here goes down to ~4µs if you delete the lazy option, but the memory usage also goes up quite a bit.

Depending on your actual use-case, if you’re spending a lot of time in the GC then maybe TensorCast.jl is useful here. The other thing to consider is if preallocating will help you here.

Topic		Replies	Views
Fast `diag(A' * B * A)` General Usage performance , linearalgebra	22	884	May 4, 2023
Diagonal elements of matrix product General Usage question , performance , linearalgebra	12	2594	November 2, 2022
Avoiding intermediate vector allocations in matrix to scalar computations Performance memory-allocation , matrix	6	484	September 20, 2023
Question About Transposes and Matrix Multiplication Speed Numerics	9	2140	August 22, 2018
Memory allocation in dot product Performance question , memory-allocation	11	1249	September 9, 2021

Fast diag(A'*B)

Related topics