Allocation / performance for 2d array sorts

I did not deeply examine the code samples, but it may be related to the problems I have gone through to re-implement an heuristic which was mostly made of applying sortperm!. I had to gut sort! and implement a hacky _swap_permute! to get the best performance. This has come up recently in this Discourse thread, in which I link to my repository and the aforementioned code.