You could. HybridArrays may give you the best of both worlds, by making them N x 3. A slice A[n,:]
should still return an SVector
. If everything inlines, it should still be able to SIMD across loop iterations, while giving you the convenience of expressing some operations on the vectors instead making everything loops.
Another factor to consider is unrolling. In the example with N x 4 vs 4 x N, the N x 4 case will SIMD and be 4x unrolled (one per column). The 4 x N will only do a single operation per loop iteration.
Theoretically, when you don’t have dependencies (i.e, s += x[i]
, where each iteration depends on the previous), your CPU should be able to execute different loop iterations in parallel via out of order processing + speculative execution, but in practice, I normally find some unrolling tends to help.
Maybe it’s because of better out of order, or maybe it’s because of better density of relevant instructions, vs things like incrementing and checking loop counters.