Fast diag(A'*B)

Is there a fast way to compute dot products of the columns of two similar matrices? MWE:

using LinearAlgebra, BenchmarkTools

f1(A, B) = diag(A'*B)           # suboptimal, but surprisingly fast
f2(A, B) = map(dot, eachcol(A), eachcol(B)) # map

A = randn(150, 25)
B = randn(size(A)...)

@belapsed f1(A, B)              # 9.2 μs
@belapsed f2(A, B)              # 1.5 μs

I can write a loop, but I was kind of hoping there is something in BLAS already, but could not find anything.

This isn’t quite what you want, but since it’s faster than f1 they must be doing something right…

using Distances
@belapsed colwise(CosineDist(), A, B)  # 3.2 μs

You can do sum(A .* B; dims = 1) assuming the is a transpose (otherwise you need an extra complex conjugate on A). This is just that (A’B)_{ii} = \sum_j (A’)_{ij} B_{ji} = \sum_j A_{ji} B_{ji} in the real case.

Edit: I’m on my phone, otherwise I’d check it numerically and benchmark. But maybe it’s fast?

Edit2: fixed index in sums

1 Like

That forms an intermediate matrix, and is much worse than map(dot, ...):

f4(A, B) = vec(sum(A .* B; dims = 1))
@belapsed f4(A, B)              # 3.5 μs

If there is no BLAS for this, then I guess I will stick with map(dot, ...).

I don’t think there’s much optimization that can be done here, other that what’s already done in dot. You could make the loop over columns multi-threaded, but this is going to be memory bound anyways, so…

1 Like

(Thanks fixing the index in your reply). If you need to do this many times you could preallocate that intermediary. Not sure if that would beat the map option though. I’m not a BLAS expert by any means, but that trick has been useful in the full-trace case for me (full sum instead of row sums).

I can’t find a solution that outperforms map(dot, ...) here on runtime, but if you’re worried about allocations, one option would be this:

using TensorCast
f3(A, B) = @reduce C[i] := sum(j) A'[i, j] * B[j, i] lazy

A = randn(150, 25)
B = randn(size(A)...)

@btime f1($A, $B) # 9.777 μs (2 allocations: 5.34 KiB)
@btime f2($A, $B) # 1.475 μs (58 allocations: 2.78 KiB)
@btime f3($A, $B) # 8.523 μs (16 allocations: 688 bytes)

so you end up trading ~6x runtime speed for ~1/4 the memory usage. Often not worth it. The runtime overhead of TensorCast here goes down to ~4µs if you delete the lazy option, but the memory usage also goes up quite a bit.

Depending on your actual use-case, if you’re spending a lot of time in the GC then maybe TensorCast.jl is useful here. The other thing to consider is if preallocating will help you here.