I have pairs of square matrix multiplications which share one of the matrices, for example:
AB,
CB,
where A, B and C are real square matrices. My question is: is it more efficient to compute both multiplications separately or is it more efficient to do the following?
[A; C]*B
that is to create a rectangular matrix, [A; C], and, then, multiply it by B.
Probably I made myself not clear. I do want the result of the two products separately, AB and CB. What I’m asking is: whether it is more efficient to use the conventional square-square multiplication separately or if it is more efficient to use rectangular multiplication and do the whole thing in a single step, then, retrieve the two square parts from the retangular result.
This mostly depends on the multithreading of your BLAS algorithm. If you’re using multiple BLAS threads, then for [A; C] * B it’ll be able to start using those threads at smaller matrix sizes than if you did the separate multiplications, but once you get to big enough matrix sizes then it’ll become more advantageous again to do the separate multiplications. E.g.:
Here we see that for N around 100, it’s a bit better (on my machine) to do the concatenated multiplication, but outside of that relatively narrow it’s preferable to do the separate muls.
The other thing though is what exactly do you mean by doing the multiplications separately? Because if you are just doing them separately and simply storing the results in separate locations, e.g. (A * B, C * B), then yes that’s usually faster, but if you’re going to be concatenating the results at the end, e.g. [A*B; C * B], then it’s basically always best to do [A; C] * B).
Thank you for your time and effort! I need the results in separate, e.g. (A*B and C*B), so, as you’ve shown, computing A*B and C*B is the way to go, instead of [A; C]*B and, then, retrieving the two results from the concatenated output. Thanks again