Sorry to not catch this all at once.
Shouldn’t this be +=
and initialized to D = zeros(V, size(A,2))
? Or maybe use a temporary variable Di = zero(V)
to do the accumulation before saving to D[i]
, if the compiler won’t hoist that on its own…
As written, the loop only does “work” on the final pair of j,k
. The compiler may not be smart enough to have caught on, so it may not have a performance impact, but I’m suspicious that this function manages to be so much faster than even B*A
despite that it should be doing “more” calculations. The bulk of that is likely the small dimension k=10
giving a meaningful opportunity to beat BLAS.
EDIT: I think there may also be an indexing bug in the expression above. It looks like something more like
D[i] += A[j,i]' * B[j,k] * A[k,i]
is correct but I derived this using different variables and might have messed up the translation.