you can
- use column major as @kristoffer.carlsson suggested (this is to me the best suggestion so far with no affense to other’s)
- use loopvectorization
- use kerbelabstraction (for gpu which would have hinted you to use column major)
- check type stability and use float32
in the end, this is is gona blow like a breeze ( I could not refrain from this last one)