The matrix A has orthonormal columns, but is rectangular (tall). Currently, I allocate caches C and D, and evaluate

mul!(D, A', mul!(C, B, A)

In my use case, A and B are somewhat large matrices, so I’d ideally like to avoid allocating the intermediate matrix C. I was wondering if there’s a way to use the structure of A to come up with a smarter algorithm? Unfortunately, B is a general (maybe complex) matrix, and doesn’t have an obvious structure.

By your setup, we know that B is square, but we can’t assume anything about symmetry?

Is A very “thin” relative to B? Could it be feasible to compute a truncated SVD to the number if columns of A, then transform the singular vectors by A?

If you need to perform the multiplications really a lot of times AND matrix A has an SVD decomposition A = U S V’ which can be truncated without altering the representation of A, then you might try to pay the price for the SVD decomposition and gain during the calculation of A’B A since there you multiply smaller matrices.