Product of permutations without allocation

Hello,

I’m trying to write a piece of code involving calculating the product of permutations, as defined here, for each column of a matrix. Here are three MWEs:

function apply_perm1!(A::AbstractMatrix{<:Integer}, B::AbstractMatrix{<:Integer})
    for n in axes(B, 2)[2:end]
        A[:, n] .= B[view(A, :, n - 1), n]
    end
    return A
end
function apply_perm2!(A::AbstractMatrix{<:Integer}, B::AbstractMatrix{<:Integer})
    for n in axes(B, 2)[2:end], m in axes(B, 1)
        A[m, n] = B[A[m, n-1], n]
    end
    return A
end
function apply_perm3!(A::AbstractMatrix{<:Integer}, B::AbstractMatrix{<:Integer})
    for n in axes(B, 2)[2:end]
        A[1, n] = B[A[1, n-1], n]
        A[2, n] = B[A[2, n-1], n]
        A[3, n] = B[A[3, n-1], n]
    end
    return A
end

All three functions yield the same output but a quick benchmark yields:

A = Matrix{Int}(undef, 3, 1000)
for n in axes(A, 2)
    A[:, n] .= randperm(M)
end
B = deepcopy(A)
@btime apply_perm1!($A, $B) # 27.083 μs (999 allocations: 78.05 KiB)
@btime apply_perm2!($A, $B) # 5.875 μs (0 allocations: 0 bytes)
@btime apply_perm3!($A, $B) # 3.083 μs (0 allocations: 0 bytes)

As far as I understand, apply_perm1! creates copies of B’s columns, which I don’t know how to avoid. (view(B, view(A, :, n - 1), n)) allocates more and becomes 50x slower). apply_perm3! is also not preferred, as in my actual project the size of these matrices may vary.

I appreciate your suggestion!

3 Likes

Indexing by brackets in Julia is lowered to the getindex function. So, you can still avoid allocation in apply_perm1! by using dot broadcasting with getindex like this:

function apply_perm1!(A::AbstractMatrix{<:Integer}, B::AbstractMatrix{<:Integer})
    for n in axes(B, 2)[2:end]
        A[:, n] .= getindex.((B,), getindex.((A,), 1:size(A,1), n-1), n)
    end
    return A
end

This has the same performance as apply_perm2, but is not as elegant.

2 Likes

Two questions:

  1. Do you need the intermediate permutations, or just the final product of all the permutations?
  2. Is the size 3 permutations fixed, or any other size fixed, so it can be baked it at compile time to improve speed (using StaticArrays)?

Wonder why this triples the allocations, the left and right sides by themselves wouldn’t allocate. At least when view (a lazy subset getindex) fails, you could fall back on a verbose elementwise getindex in a fused loop, though in this case it was clearer to add the layer to the existing for-loop like apply_perm2!.

EDIT: A version with view is (2997 allocations: 22.99 MiB), which is 24131 bytes per iteration/3 allocations, which is really close to the size of A or B and suggests that there is an intermediate somewhere in materialize! copying the entirety of either. This isn’t the effect of matrix views with linear slices allocating a Vector pointing to the matrix’s buffer instead of copying it; this props up on discourse once in a while but the allocations are too small to explain this.

Thanks for this suggestion! I attempted using getindex but couldn’t get it to work myself. In principle, either your suggestion or apply_perm2 is sufficient for me to move on, but I just wonder if apply_perm3 can be generalized.

All the permutations are needed, as I plan to use A and B later as indices for other matrices.

Isn’t StaticArrays preferred when there are no more than 100 elements? Currently in my code, matrices A and B have fixed number of columns, but the number of rows vary.

I found this page where a similar issue is briefly discussed. However, I cannot find a “solution”.

1 Like

It means, the 3 is not the fixed number, but the 1000 is fixed??

That is correct.

That could explain the whole phenomenon, forget what I said about materialize!. The answer was worded a bit confusingly so the point there is while the parent array may be efficiently reused, the indices are copied, like collect. In your case, recomputing view(A, :, n - 1) is 3 elements, repeating that 999 times almost reaches the size of A, but there’s more memory because each array has its own overhead. It wasn’t explained why the copying must happen, and that post addresses copying of Vectors, not views, so it still seems like an open question whether view indices can be lazily computed instead.

I suppose for now, it’s simple to understand “some types of indices are copied” as a limitation, but it makes me wonder if there is a way to make a version of @views that turns bracket syntax in A[:, n] .= B[A[:, n - 1], n] into elementwise getindex instead of view calls, at least for the right-hand side. No idea how the left-hand expression could be handled, that has a built-in view-like Base.dotview to specify the elements to write to.

function apply_perm4!(A::Matrix{Int}, B::Matrix{Int})
    size(A) == size(B) || error("A and B must share size")
    @views A[:,1] == B[:,1] || error("A and B share first column??")
    foldl((n, (d, s)) -> map!(Base.Fix1(getindex,s),d,n), 
      zip(eachcol(A), eachcol(B)); 
      init=Int[1:size(B,1)...])
    return A
end

gets to about the same speed as apply_perm3, but is also much trickier to understand than a nested for loop based implementation. This can provide entertainment for future readers of the code :laughing: .

It doesn’t depend on hard-coded M = 3. But really, a nested for loop seems to be the best idea:

function apply_perm5!(A::AbstractMatrix{<:Integer}, B::AbstractMatrix{<:Integer})
    for n in axes(B, 2)[2:end], i in axes(B,1)
        A[i, n] = B[A[i, n-1], n]
    end
    return A  
    # note, to match `apply_perm4` A needs to have first column match B
end
1 Like

Understanding this piece of code is indeed some “mental weightlifting” :laughing:. Although on my desktop apply_perm4 is still slower than apply_perm3, it does run a bit faster than the nested-loop-based apply_perm2.

apply_perm4 is also quite inspiring as it is the only implementation so far that can work on a GPU upon some slight modification (e.g. change init=Int[1:size(B,1)...] to init=CuArray(Int[1:size(B,1)...])). All other implementations involve some scalar indexing. Working directly with CuArray A and B was my initial goal, as A would later be used as indices of another CuArray.

I totally agree with you that nested for loop currently seems to be the overall best approach, but I will mark your apply_perm4 as the solution for how eye-opening it is. :+1: