For an out-of-place transpose like this, to get good cache-line utilization you want to do neither order: you generally want to “tile” the loops, either by tuning to your cache or by using a cache-oblivious algorithm.
(Optimizing transposition is a heavily studied problem, with a fair amount of literature and code out there if you search.)
See also e.g. Function on matrix transpose and performance - #4 by stevengj and the links in that thread.