Terribly unoptimized combination of multiset_permutations() and Iterators.product()

Since I got great feedback on my first post, I turn to you again with another question. In my larger code base I want to iterate over a special kind of binary matrices.
First fact about the binary matrices, they have exactly one 1 in each column, so I modeled them as integer rowVectors, one entry for each column, the integer is equal to the row where the 1 sits.
Second fact, I can split the matrix into blocks, and for each block I know the number of 1s for each row.
(I think it’s not necessary to go into more detail about the mathematical specification of those matrices, but I can if somebody wants to know more, to maybe use a whole different approach).

To iterate over all of them, I gather the vector indicating the row sums for each block in the tuple vBlocks, create an iterator of all multiset_permutations() that fulfill those row sums for each block, and then get all combinations of those using Iterators.product(), the code for this is

using Combinatorics
using BenchmarkTools


function getCBConfigsForPDVTripleOptTest(vecR, vecB)
    vBlocks = (rComp .* vecB for rComp in vecR)
    
    blockIterators = [createBlockIteratorOpt(v) for v in vBlocks]

    for rowVectorBlocks in Iterators.product(blockIterators...)
        # rest of code operates allocation free in this for loop    
    end
end


function createBlockIteratorOpt(rowCounts::Vector{Int})
    n = sum(rowCounts);
    permVector = zeros(Int, n,)
    col = 0;
    for row in eachindex(rowCounts)
        for l in 1:rowCounts[row]
            permVector[col + l] = row;
        end
        col += rowCounts[row];
    end
    
    return multiset_permutations(permVector, length(permVector))
end

The performance is just terrible, in general I look at cases where sum(vecR) and sum(vecB) are maximum 6 but for vecR = [2,1,1] and vecB = [2,1,1] I already get

@btime getCBConfigsForPDVTripleOptTest(vecR, vecB)
  10.180 ms (363617 allocations: 23.11 MiB)

and the allocations just explode for more higher cases, making those cases unable to run. I had an implementation that used binomial choosing of column indices instead of multiset permutations of a row vector, but it was performing even worse (and combinatorially they should be the same)
Once again I really appreciate any help :slight_smile:

Here is a first attempt using a custom implementation of multiset permutations based on Pandita’s algorithm and SmallCollections.jl. There are still allocations, but much fewer:

julia> using Chairmarks

julia> @b getCBConfigsForPDVTripleOptTest($vecR, $vecB)
1.193 ms (323 allocs: 10.391 KiB)
Code
using SmallCollections

const N = 16

const SV = SmallVector{N,Int8}
const MSV = MutableSmallVector{N,Int8}

vecR = SV([2,1,1])
vecB = SV([2,1,1])

# multiset permutations

import Base: iterate, eltype, IteratorSize, SizeUnknown

function swap!(v::AbstractVector, i, j)
    @inbounds v[i], v[j] = v[j], v[i]
    v
end

struct MultiSetPermutations
    v::SV
end

@inline function iterate(mp::MultiSetPermutations)
    @assert issorted(mp.v)
    isempty(mp.v) ? nothing : (mp.v, MSV(mp.v))
end

@inline function iterate(mp::MultiSetPermutations, w::MutableSmallVector)
    # Pandita's algorithm
    u = fixedvector(w)
    d = first(SmallCollections.popfirst(u)) - u
    k = findlast(>(Int8(0)), d)
    k === nothing && return nothing
    @inbounds l = findlast(>(u[k]), u)::Int
    swap!(w, k, l)
    reverse!(w, k+1, length(w))
    SmallVector(w), w
end

eltype(mp::MultiSetPermutations) = SV

IteratorSize(mp::MultiSetPermutations) = SizeUnknown()  # TODO: compute size

# jonbir

using Base.Cartesian: @nloops, @ntuple

@generated function do_work(vecs::Vararg{AbstractVector,N}) where N
    quote
        s = 0
        @nloops($N, w, k -> createBlockIteratorOpt(vecs[k]),
            begin
                # do something
                t = @ntuple $N w
                s += sum(map(first, t))
            end)
        s
    end
end

function getCBConfigsForPDVTripleOptTest(vecR, vecB)
    vBlocks = vecR .* Ref(vecB)
    do_work(vBlocks...)
end

function createBlockIteratorOpt(rowCounts)
    n = sum(rowCounts);
    permVector = zeros(MSV, n)
    col = 0;
    for (row, c) in enumerate(rowCounts)
        fill!(view(permVector, col+1:col+c), row)
        col += c;
    end
    MultiSetPermutations(permVector)
end

Currently all vectors must have length at most N = 16, including the multiset vectors permVector. You may have to increase this for longer vecR and vecB.

If your algorithm is feasible at all for longer vectors, then you certainly need multi-threading because the number of multiset permutations you are looping over grows extremely fast.

2 Likes

Ok, you’re a wizard.
I will try to understand your code, just for now: Does my code go where you wrote # do something, or in getCBConfigsForPDVTripleOptTest after do_work(vBlocks...)?

I felt like Matt Parker (funny maths youtuber), when he wrote code to get all combinations of 5 five-letter words that share no letter, and viewers improved his code by A LOT. When I wrote mine, I always had the feeling that there is the same room for improvement. :smile:

Do you want to know the larger problem that led to my unoptimized code? No worries, I’m not expecting anything, from your writing I just thought that maybe you are curious.

Your code goes to the # do something section. Currently I’m adding up the first elements of all multipermutations (loop variables). You can access the individual loop variables as w_1, w_2 and so on, see @nloops. With @ntuple you can get all loop variables as a tuple.

Feel free to tell me more about your problem, maybe via private messages to avoid derailing this thread.

I suspect part of the slowness and many allocations of your original is from type instability and dynamic dispatch resulting from Iterators.product(blockIterators...) not knowing the length of blockIterators at compile time (because it’s a vector) and so this iterator and every value from it is of unknown type.

This could be a good spot for a function barrier. It might be as simple as replacing this for rowVectorBlocks in Iterators.product(blockIterators...) with foreach(Iterators.product(blockIterators...)) do rowVectorBlocks, with the foreach serving to provide the function barrier.

Using ntuple can be another way around these difficulties, but is probably a little trickier to write.

2 Likes

This eliminates 1/3 of the allocations and cuts run time roughly in half.

1 Like