Product of permutations without allocation

lance_xwq · December 9, 2023, 5:57pm

Hello,

I’m trying to write a piece of code involving calculating the product of permutations, as defined here, for each column of a matrix. Here are three MWEs:

function apply_perm1!(A::AbstractMatrix{<:Integer}, B::AbstractMatrix{<:Integer})
    for n in axes(B, 2)[2:end]
        A[:, n] .= B[view(A, :, n - 1), n]
    end
    return A
end

function apply_perm2!(A::AbstractMatrix{<:Integer}, B::AbstractMatrix{<:Integer})
    for n in axes(B, 2)[2:end], m in axes(B, 1)
        A[m, n] = B[A[m, n-1], n]
    end
    return A
end

function apply_perm3!(A::AbstractMatrix{<:Integer}, B::AbstractMatrix{<:Integer})
    for n in axes(B, 2)[2:end]
        A[1, n] = B[A[1, n-1], n]
        A[2, n] = B[A[2, n-1], n]
        A[3, n] = B[A[3, n-1], n]
    end
    return A
end

All three functions yield the same output but a quick benchmark yields:

A = Matrix{Int}(undef, 3, 1000)
for n in axes(A, 2)
    A[:, n] .= randperm(M)
end
B = deepcopy(A)
@btime apply_perm1!($A, $B) # 27.083 μs (999 allocations: 78.05 KiB)
@btime apply_perm2!($A, $B) # 5.875 μs (0 allocations: 0 bytes)
@btime apply_perm3!($A, $B) # 3.083 μs (0 allocations: 0 bytes)

As far as I understand, apply_perm1! creates copies of B’s columns, which I don’t know how to avoid. (view(B, view(A, :, n - 1), n)) allocates more and becomes 50x slower). apply_perm3! is also not preferred, as in my actual project the size of these matrices may vary.

I appreciate your suggestion!

Seif_Shebl · December 9, 2023, 8:48pm

Indexing by brackets in Julia is lowered to the getindex function. So, you can still avoid allocation in apply_perm1! by using dot broadcasting with getindex like this:

function apply_perm1!(A::AbstractMatrix{<:Integer}, B::AbstractMatrix{<:Integer})
    for n in axes(B, 2)[2:end]
        A[:, n] .= getindex.((B,), getindex.((A,), 1:size(A,1), n-1), n)
    end
    return A
end

This has the same performance as apply_perm2, but is not as elegant.

Dan · December 9, 2023, 9:00pm

Two questions:

Do you need the intermediate permutations, or just the final product of all the permutations?
Is the size 3 permutations fixed, or any other size fixed, so it can be baked it at compile time to improve speed (using StaticArrays)?

Benny · December 9, 2023, 10:47pm

Wonder why this triples the allocations, the left and right sides by themselves wouldn’t allocate. At least when view (a lazy subset getindex) fails, you could fall back on a verbose elementwise getindex in a fused loop, though in this case it was clearer to add the layer to the existing for-loop like apply_perm2!.

EDIT: A version with view is (2997 allocations: 22.99 MiB), which is 24131 bytes per iteration/3 allocations, which is really close to the size of A or B and suggests that there is an intermediate somewhere in materialize! copying the entirety of either. This isn’t the effect of matrix views with linear slices allocating a Vector pointing to the matrix’s buffer instead of copying it; this props up on discourse once in a while but the allocations are too small to explain this.

lance_xwq · December 11, 2023, 6:07pm

Thanks for this suggestion! I attempted using getindex but couldn’t get it to work myself. In principle, either your suggestion or apply_perm2 is sufficient for me to move on, but I just wonder if apply_perm3 can be generalized.

lance_xwq · December 11, 2023, 6:17pm

All the permutations are needed, as I plan to use A and B later as indices for other matrices.

Isn’t StaticArrays preferred when there are no more than 100 elements? Currently in my code, matrices A and B have fixed number of columns, but the number of rows vary.

lance_xwq · December 11, 2023, 6:32pm

I found this page where a similar issue is briefly discussed. However, I cannot find a “solution”.

Dan · December 11, 2023, 11:22pm

It means, the 3 is not the fixed number, but the 1000 is fixed??

lance_xwq · December 11, 2023, 11:37pm

That is correct.

Benny · December 11, 2023, 11:45pm

That could explain the whole phenomenon, forget what I said about materialize!. The answer was worded a bit confusingly so the point there is while the parent array may be efficiently reused, the indices are copied, like collect. In your case, recomputing view(A, :, n - 1) is 3 elements, repeating that 999 times almost reaches the size of A, but there’s more memory because each array has its own overhead. It wasn’t explained why the copying must happen, and that post addresses copying of Vectors, not views, so it still seems like an open question whether view indices can be lazily computed instead.

I suppose for now, it’s simple to understand “some types of indices are copied” as a limitation, but it makes me wonder if there is a way to make a version of @views that turns bracket syntax in A[:, n] .= B[A[:, n - 1], n] into elementwise getindex instead of view calls, at least for the right-hand side. No idea how the left-hand expression could be handled, that has a built-in view-like Base.dotview to specify the elements to write to.

Dan · December 12, 2023, 12:28am

function apply_perm4!(A::Matrix{Int}, B::Matrix{Int})
    size(A) == size(B) || error("A and B must share size")
    @views A[:,1] == B[:,1] || error("A and B share first column??")
    foldl((n, (d, s)) -> map!(Base.Fix1(getindex,s),d,n), 
      zip(eachcol(A), eachcol(B)); 
      init=Int[1:size(B,1)...])
    return A
end

gets to about the same speed as apply_perm3, but is also much trickier to understand than a nested for loop based implementation. This can provide entertainment for future readers of the code .

It doesn’t depend on hard-coded M = 3. But really, a nested for loop seems to be the best idea:

function apply_perm5!(A::AbstractMatrix{<:Integer}, B::AbstractMatrix{<:Integer})
    for n in axes(B, 2)[2:end], i in axes(B,1)
        A[i, n] = B[A[i, n-1], n]
    end
    return A  
    # note, to match `apply_perm4` A needs to have first column match B
end

lance_xwq · December 12, 2023, 10:48pm

Understanding this piece of code is indeed some “mental weightlifting” . Although on my desktop apply_perm4 is still slower than apply_perm3, it does run a bit faster than the nested-loop-based apply_perm2.

apply_perm4 is also quite inspiring as it is the only implementation so far that can work on a GPU upon some slight modification (e.g. change init=Int[1:size(B,1)...] to init=CuArray(Int[1:size(B,1)...])). All other implementations involve some scalar indexing. Working directly with CuArray A and B was my initial goal, as A would later be used as indices of another CuArray.

I totally agree with you that nested for loop currently seems to be the overall best approach, but I will mark your apply_perm4 as the solution for how eye-opening it is.

Topic		Replies	Views
Permute Multidimensional Array without allocation General Usage	9	2230	September 25, 2019
Combinatorics.permutations creating a lot of allocations Performance question	6	149	July 14, 2025
Fastest way to permute Array, given some permutation Performance sortperm , arrays	13	4521	September 12, 2021
How to quickly create array of array? General Usage question	12	1033	February 20, 2018
Common allocation mistakes Performance memory-allocation	47	7162	August 21, 2023

Product of permutations without allocation

Related topics