Array indexing/slicing is slow

Allocate this outside this function, and pass it as a parameter. Allocate just one per thread.

(Also invert the order of the loops, run over i first, then j, then k)