I have two matrices
B, and I need to compute the diagonal elements of the product
A*B as fast as possible, and store them in a pre-allocated vector.
What’s the fastest way to do this? I mean faster than writing my own loop (i.e., maybe hitting an appropriate BLAS routine, restructuring the input if needed).