Unexpected poor FOR loops performance

If you’re doing O(n^3) work, and n is sufficiently large, it’s almost always worth the O(n^2) memory allocation to call a fast BLAS routine. That being said, if you are doing this over and over again in an iterative solver, you can probably pre-allocate any arrays you need for BLAS/LAPACK.

PS. Note that this has nothing to do with for loops or Julia, and everything to do with the tricks that are required to make matrix–matrix multiplications fast. These kinds of optimizations can be done in Julia too (e.g. see Octavian.jl), but require a different algorithm than textbook-style 3-nested loops: Julia matrix-multiplication performance

11 Likes