Is there a fast way to compute dot products of the columns of two similar matrices? MWE:
using LinearAlgebra, BenchmarkTools
f1(A, B) = diag(A'*B) # suboptimal, but surprisingly fast
f2(A, B) = map(dot, eachcol(A), eachcol(B)) # map
A = randn(150, 25)
B = randn(size(A)...)
@belapsed f1(A, B) # 9.2 μs
@belapsed f2(A, B) # 1.5 μs
I can write a loop, but I was kind of hoping there is something in BLAS already, but could not find anything.
You can do sum(A .* B; dims = 1) assuming the ’ is a transpose (otherwise you need an extra complex conjugate on A). This is just that (A’B)_{ii} = \sum_j (A’)_{ij} B_{ji} = \sum_j A_{ji} B_{ji} in the real case.
Edit: I’m on my phone, otherwise I’d check it numerically and benchmark. But maybe it’s fast?
I don’t think there’s much optimization that can be done here, other that what’s already done in dot. You could make the loop over columns multi-threaded, but this is going to be memory bound anyways, so…
(Thanks fixing the index in your reply). If you need to do this many times you could preallocate that intermediary. Not sure if that would beat the map option though. I’m not a BLAS expert by any means, but that trick has been useful in the full-trace case for me (full sum instead of row sums).
so you end up trading ~6x runtime speed for ~1/4 the memory usage. Often not worth it. The runtime overhead of TensorCast here goes down to ~4µs if you delete the lazy option, but the memory usage also goes up quite a bit.
Depending on your actual use-case, if you’re spending a lot of time in the GC then maybe TensorCast.jl is useful here. The other thing to consider is if preallocating will help you here.