# Product of permutations without allocation

Hello,

I’m trying to write a piece of code involving calculating the product of permutations, as defined here, for each column of a matrix. Here are three MWEs:

``````function apply_perm1!(A::AbstractMatrix{<:Integer}, B::AbstractMatrix{<:Integer})
for n in axes(B, 2)[2:end]
A[:, n] .= B[view(A, :, n - 1), n]
end
return A
end
``````
``````function apply_perm2!(A::AbstractMatrix{<:Integer}, B::AbstractMatrix{<:Integer})
for n in axes(B, 2)[2:end], m in axes(B, 1)
A[m, n] = B[A[m, n-1], n]
end
return A
end
``````
``````function apply_perm3!(A::AbstractMatrix{<:Integer}, B::AbstractMatrix{<:Integer})
for n in axes(B, 2)[2:end]
A[1, n] = B[A[1, n-1], n]
A[2, n] = B[A[2, n-1], n]
A[3, n] = B[A[3, n-1], n]
end
return A
end
``````

All three functions yield the same output but a quick benchmark yields:

``````A = Matrix{Int}(undef, 3, 1000)
for n in axes(A, 2)
A[:, n] .= randperm(M)
end
B = deepcopy(A)
@btime apply_perm1!(\$A, \$B) # 27.083 μs (999 allocations: 78.05 KiB)
@btime apply_perm2!(\$A, \$B) # 5.875 μs (0 allocations: 0 bytes)
@btime apply_perm3!(\$A, \$B) # 3.083 μs (0 allocations: 0 bytes)
``````

As far as I understand, `apply_perm1!` creates copies of `B`’s columns, which I don’t know how to avoid. (`view(B, view(A, :, n - 1), n)`) allocates more and becomes 50x slower). `apply_perm3!` is also not preferred, as in my actual project the size of these matrices may vary.

3 Likes

Indexing by brackets in Julia is lowered to the `getindex` function. So, you can still avoid allocation in `apply_perm1!` by using dot broadcasting with `getindex` like this:

``````function apply_perm1!(A::AbstractMatrix{<:Integer}, B::AbstractMatrix{<:Integer})
for n in axes(B, 2)[2:end]
A[:, n] .= getindex.((B,), getindex.((A,), 1:size(A,1), n-1), n)
end
return A
end
``````

This has the same performance as `apply_perm2`, but is not as elegant.

2 Likes

Two questions:

1. Do you need the intermediate permutations, or just the final product of all the permutations?
2. Is the size 3 permutations fixed, or any other size fixed, so it can be baked it at compile time to improve speed (using StaticArrays)?

Wonder why this triples the allocations, the left and right sides by themselves wouldn’t allocate. At least when `view` (a lazy subset `getindex`) fails, you could fall back on a verbose elementwise `getindex` in a fused loop, though in this case it was clearer to add the layer to the existing for-loop like `apply_perm2!`.

EDIT: A version with `view` is (2997 allocations: 22.99 MiB), which is 24131 bytes per iteration/3 allocations, which is really close to the size of `A` or `B` and suggests that there is an intermediate somewhere in `materialize!` copying the entirety of either. This isn’t the effect of matrix views with linear slices allocating a `Vector` pointing to the matrix’s buffer instead of copying it; this props up on discourse once in a while but the allocations are too small to explain this.

Thanks for this suggestion! I attempted using `getindex` but couldn’t get it to work myself. In principle, either your suggestion or `apply_perm2` is sufficient for me to move on, but I just wonder if `apply_perm3` can be generalized.

All the permutations are needed, as I plan to use `A` and `B` later as indices for other matrices.

Isn’t StaticArrays preferred when there are no more than 100 elements? Currently in my code, matrices `A` and `B` have fixed number of columns, but the number of rows vary.

I found this page where a similar issue is briefly discussed. However, I cannot find a “solution”.

1 Like

It means, the `3` is not the fixed number, but the `1000` is fixed??

That is correct.

That could explain the whole phenomenon, forget what I said about `materialize!`. The answer was worded a bit confusingly so the point there is while the parent array may be efficiently reused, the indices are copied, like `collect`. In your case, recomputing `view(A, :, n - 1)` is 3 elements, repeating that 999 times almost reaches the size of `A`, but there’s more memory because each array has its own overhead. It wasn’t explained why the copying must happen, and that post addresses copying of `Vector`s, not `view`s, so it still seems like an open question whether `view` indices can be lazily computed instead.

I suppose for now, it’s simple to understand “some types of indices are copied” as a limitation, but it makes me wonder if there is a way to make a version of `@views` that turns bracket syntax in `A[:, n] .= B[A[:, n - 1], n]` into elementwise `getindex` instead of `view` calls, at least for the right-hand side. No idea how the left-hand expression could be handled, that has a built-in `view`-like `Base.dotview` to specify the elements to write to.

``````function apply_perm4!(A::Matrix{Int}, B::Matrix{Int})
size(A) == size(B) || error("A and B must share size")
@views A[:,1] == B[:,1] || error("A and B share first column??")
foldl((n, (d, s)) -> map!(Base.Fix1(getindex,s),d,n),
zip(eachcol(A), eachcol(B));
init=Int[1:size(B,1)...])
return A
end
``````

gets to about the same speed as `apply_perm3`, but is also much trickier to understand than a nested `for` loop based implementation. This can provide entertainment for future readers of the code .

It doesn’t depend on hard-coded `M = 3`. But really, a nested `for` loop seems to be the best idea:

``````function apply_perm5!(A::AbstractMatrix{<:Integer}, B::AbstractMatrix{<:Integer})
for n in axes(B, 2)[2:end], i in axes(B,1)
A[i, n] = B[A[i, n-1], n]
end
return A
# note, to match `apply_perm4` A needs to have first column match B
end
``````
1 Like

Understanding this piece of code is indeed some “mental weightlifting” . Although on my desktop `apply_perm4` is still slower than `apply_perm3`, it does run a bit faster than the nested-loop-based `apply_perm2`.

`apply_perm4` is also quite inspiring as it is the only implementation so far that can work on a GPU upon some slight modification (e.g. change `init=Int[1:size(B,1)...]` to `init=CuArray(Int[1:size(B,1)...])`). All other implementations involve some scalar indexing. Working directly with CuArray `A` and `B` was my initial goal, as `A` would later be used as indices of another CuArray.

I totally agree with you that nested for loop currently seems to be the overall best approach, but I will mark your `apply_perm4` as the solution for how eye-opening it is.