I have two matrices `A`

and `B`

, and I need to compute the diagonal elements of the product `A*B`

as fast as possible, and store them in a pre-allocated vector.

What’s the fastest way to do this? I mean faster than writing my own loop (i.e., maybe hitting an appropriate BLAS routine, restructuring the input if needed).