When you construct Aa
, construct it transposed so that you don’t have to pass Aa'
to ldiv!
— I suspect that you are hitting a slow generic fallback (not LAPACK).
Then, at the end, multiply Aa' * B
(there are fast BLAS calls for multiplying by transposed matrices).