Optimizing performance

you can

  • use column major as @kristoffer.carlsson suggested (this is to me the best suggestion so far with no affense to other’s)
  • use loopvectorization
  • use kerbelabstraction (for gpu which would have hinted you to use column major)
  • check type stability and use float32

in the end, this is is gona blow like a breeze ( I could not refrain from this last one)

1 Like