Fastest way to permute Array, given some permutation

I wonder what the fastest way is to apply a permutation to an array. Three approaches that I checked are:

julia> A = rand(10000);

julia> permutation = sortperm(A);

julia> @btime $A[$permutation];
  11.700 μs (3 allocations: 78.22 KiB)

julia> @btime $A .= $A[$permutation];
  13.200 μs (3 allocations: 78.22 KiB)

julia> @btime permute!($A,$permutation);
  42.000 μs (2 allocations: 78.20 KiB)

None of these methods seem to be inplace (for the first one it is not surprising), I did however not expect the first approach to be the fastest. Any ideas what is causing this?

The permute! function acts in-place in A, but requires the allocation of an auxiliary index array. You can do it fully in-place with the undocumented Base.permute!! function, which also overwrites the permutation array (so that you can only call it once).

@btime Base.permute!!($A,p) setup=(p = copy($permutation)) evals=1;

However, it only saves about 15% on my machine.

The basic issue here, I think, is that permuting an array in-place is generally a tricky problem, involving chasing the cycles of the permutation, that takes a bunch of bookkeeping and has poor memory-access patterns for cache-line utilization. Simply making an out-of-place copy via A[permutation] is much more straightforward (and involves more consecutive memory access) and hence faster.

8 Likes

Okay thank you, then I will not feel too bad for allocating this extra memory :grinning:

Note that you gain another 20% by doing

julia> @btime @inbounds $A[$permutation];
  9.356 μs (2 allocations: 78.20 KiB)
3 Likes

Good point, that’s true!

Note that over-optimizing permutations may cause damage to your sanity, I know because I had to do it for maximum efficiency to replicate a literature heuristic.

The optimized heuristic code is here. Basically, for each Vector that needs to be permuted frequently, I allocate a copy a single time, and then in the literal million iterations following, I copy-permute the original vector to the copy, and then swap who is the copy and who is the original in the loop scope.

I also had to gut the sort internals, it was not pretty. But I got rid of any allocations and my final code was six times faster. I just had to keep the old code to be able to check if the results were the same, or I introduced any bugs.

4 Likes

This indeed looks a bit tedious :smiley:

But for a 6x speed-up it might be worth it ^^

It become so much faster only because the code stopped allocating anything this way, and it is a code that literally runs at least one million iterations before stopping, so even one allocation inside the loop caused considerable slowdown, as the rest of the loop is just permutation of small vectors.

Yes, fortunately for me, this permutation line is only called a few times, so I don’t have to optimize too much. The reason for creating this thread was more out of curiosity and because I was suprised that Base.permute! is slower :slight_smile:

Note that you can do this in-place on preallocated output using B .= getindex.(Ref(A), permutation). For example:

julia> @btime $A[$permutation];
  18.972 μs (2 allocations: 78.20 KiB)

julia> B = similar(A);

julia> @btime $B .= getindex.(Ref($A), $permutation);
  7.601 μs (0 allocations: 0 bytes)

It’s a good thought exercise to make sure you understand why A .= getindex.(Ref(A), permutation) would not work (it executes, but produces the wrong A in general).

1 Like

Yes indeed, I originally thought that this is what

would be doing (similar to x, y = y, x) until I realized that this line just creates a permuted copy of A and then replaces the array element-by element (which in hindsight seems obvious).

Also interesting that

does not seem to benefit from @inbounds :slight_smile:

Would have been nice if permute! took an optional additional array to use instead of always allocating it (or overwriting the original in the case of permute!!). I’ve a case where I repeatedly permute not-too-long arrays and this de/allocation each time can be a performance issue. Is this something that might be considered as a feature for the permute! function?

You could pass copyto!(pcopy, p) to permute!! if you want to preserve the original permutation array while using a preallocated buffer.

1 Like

@stevengj - right, I should have though of that. This way I’d be able to apply the permutation as many times as I want w/o any allocations. Thanks!