When you construct Aa, construct it transposed so that you don’t have to pass Aa' to ldiv! — I suspect that you are hitting a slow generic fallback (not LAPACK).
Then, at the end, multiply Aa' * B (there are fast BLAS calls for multiplying by transposed matrices).