Fast `diag(A' * B * A)`

Sorry to not catch this all at once.

Shouldn’t this be += and initialized to D = zeros(V, size(A,2))? Or maybe use a temporary variable Di = zero(V) to do the accumulation before saving to D[i], if the compiler won’t hoist that on its own…

As written, the loop only does “work” on the final pair of j,k. The compiler may not be smart enough to have caught on, so it may not have a performance impact, but I’m suspicious that this function manages to be so much faster than even B*A despite that it should be doing “more” calculations. The bulk of that is likely the small dimension k=10 giving a meaningful opportunity to beat BLAS.

EDIT: I think there may also be an indexing bug in the expression above. It looks like something more like

D[i] += A[j,i]' * B[j,k] * A[k,i]

is correct but I derived this using different variables and might have messed up the translation.