Note that over-optimizing permutations may cause damage to your sanity, I know because I had to do it for maximum efficiency to replicate a literature heuristic.
The optimized heuristic code is here. Basically, for each Vector
that needs to be permuted frequently, I allocate a copy a single time, and then in the literal million iterations following, I copy-permute the original vector to the copy, and then swap who is the copy and who is the original in the loop scope.
I also had to gut the sort
internals, it was not pretty. But I got rid of any allocations and my final code was six times faster. I just had to keep the old code to be able to check if the results were the same, or I introduced any bugs.