Terribly unoptimized combination of multiset_permutations() and Iterators.product()

jonbir · July 14, 2025, 7:31pm

Since I got great feedback on my first post, I turn to you again with another question. In my larger code base I want to iterate over a special kind of binary matrices.
First fact about the binary matrices, they have exactly one 1 in each column, so I modeled them as integer rowVectors, one entry for each column, the integer is equal to the row where the 1 sits.
Second fact, I can split the matrix into blocks, and for each block I know the number of 1s for each row.
(I think it’s not necessary to go into more detail about the mathematical specification of those matrices, but I can if somebody wants to know more, to maybe use a whole different approach).

To iterate over all of them, I gather the vector indicating the row sums for each block in the tuple vBlocks, create an iterator of all multiset_permutations() that fulfill those row sums for each block, and then get all combinations of those using Iterators.product(), the code for this is

using Combinatorics
using BenchmarkTools


function getCBConfigsForPDVTripleOptTest(vecR, vecB)
    vBlocks = (rComp .* vecB for rComp in vecR)
    
    blockIterators = [createBlockIteratorOpt(v) for v in vBlocks]

    for rowVectorBlocks in Iterators.product(blockIterators...)
        # rest of code operates allocation free in this for loop    
    end
end


function createBlockIteratorOpt(rowCounts::Vector{Int})
    n = sum(rowCounts);
    permVector = zeros(Int, n,)
    col = 0;
    for row in eachindex(rowCounts)
        for l in 1:rowCounts[row]
            permVector[col + l] = row;
        end
        col += rowCounts[row];
    end
    
    return multiset_permutations(permVector, length(permVector))
end

The performance is just terrible, in general I look at cases where sum(vecR) and sum(vecB) are maximum 6 but for vecR = [2,1,1] and vecB = [2,1,1] I already get

@btime getCBConfigsForPDVTripleOptTest(vecR, vecB)
  10.180 ms (363617 allocations: 23.11 MiB)

and the allocations just explode for more higher cases, making those cases unable to run. I had an implementation that used binomial choosing of column indices instead of multiset permutations of a row vector, but it was performing even worse (and combinatorially they should be the same)
Once again I really appreciate any help

matthias314 · July 14, 2025, 11:36pm

Here is a first attempt using a custom implementation of multiset permutations based on Pandita’s algorithm and SmallCollections.jl. There are still allocations, but much fewer:

julia> using Chairmarks

julia> @b getCBConfigsForPDVTripleOptTest($vecR, $vecB)
1.193 ms (323 allocs: 10.391 KiB)

Code

using SmallCollections

const N = 16

const SV = SmallVector{N,Int8}
const MSV = MutableSmallVector{N,Int8}

vecR = SV([2,1,1])
vecB = SV([2,1,1])

# multiset permutations

import Base: iterate, eltype, IteratorSize, SizeUnknown

function swap!(v::AbstractVector, i, j)
    @inbounds v[i], v[j] = v[j], v[i]
    v
end

struct MultiSetPermutations
    v::SV
end

@inline function iterate(mp::MultiSetPermutations)
    @assert issorted(mp.v)
    isempty(mp.v) ? nothing : (mp.v, MSV(mp.v))
end

@inline function iterate(mp::MultiSetPermutations, w::MutableSmallVector)
    # Pandita's algorithm
    u = fixedvector(w)
    d = first(SmallCollections.popfirst(u)) - u
    k = findlast(>(Int8(0)), d)
    k === nothing && return nothing
    @inbounds l = findlast(>(u[k]), u)::Int
    swap!(w, k, l)
    reverse!(w, k+1, length(w))
    SmallVector(w), w
end

eltype(mp::MultiSetPermutations) = SV

IteratorSize(mp::MultiSetPermutations) = SizeUnknown()  # TODO: compute size

# jonbir

using Base.Cartesian: @nloops, @ntuple

@generated function do_work(vecs::Vararg{AbstractVector,N}) where N
    quote
        s = 0
        @nloops($N, w, k -> createBlockIteratorOpt(vecs[k]),
            begin
                # do something
                t = @ntuple $N w
                s += sum(map(first, t))
            end)
        s
    end
end

function getCBConfigsForPDVTripleOptTest(vecR, vecB)
    vBlocks = vecR .* Ref(vecB)
    do_work(vBlocks...)
end

function createBlockIteratorOpt(rowCounts)
    n = sum(rowCounts);
    permVector = zeros(MSV, n)
    col = 0;
    for (row, c) in enumerate(rowCounts)
        fill!(view(permVector, col+1:col+c), row)
        col += c;
    end
    MultiSetPermutations(permVector)
end

Currently all vectors must have length at most N = 16, including the multiset vectors permVector. You may have to increase this for longer vecR and vecB.

If your algorithm is feasible at all for longer vectors, then you certainly need multi-threading because the number of multiset permutations you are looping over grows extremely fast.

jonbir · July 15, 2025, 9:25am

Ok, you’re a wizard.
I will try to understand your code, just for now: Does my code go where you wrote # do something, or in getCBConfigsForPDVTripleOptTest after do_work(vBlocks...)?

I felt like Matt Parker (funny maths youtuber), when he wrote code to get all combinations of 5 five-letter words that share no letter, and viewers improved his code by A LOT. When I wrote mine, I always had the feeling that there is the same room for improvement.

Do you want to know the larger problem that led to my unoptimized code? No worries, I’m not expecting anything, from your writing I just thought that maybe you are curious.

matthias314 · July 15, 2025, 12:18pm

Your code goes to the # do something section. Currently I’m adding up the first elements of all multipermutations (loop variables). You can access the individual loop variables as w_1, w_2 and so on, see @nloops. With @ntuple you can get all loop variables as a tuple.

Feel free to tell me more about your problem, maybe via private messages to avoid derailing this thread.

mikmoore · July 15, 2025, 2:29pm

I suspect part of the slowness and many allocations of your original is from type instability and dynamic dispatch resulting from Iterators.product(blockIterators...) not knowing the length of blockIterators at compile time (because it’s a vector) and so this iterator and every value from it is of unknown type.

This could be a good spot for a function barrier. It might be as simple as replacing this for rowVectorBlocks in Iterators.product(blockIterators...) with foreach(Iterators.product(blockIterators...)) do rowVectorBlocks, with the foreach serving to provide the function barrier.

Using ntuple can be another way around these difficulties, but is probably a little trickier to write.

matthias314 · July 15, 2025, 3:57pm

This eliminates 1/3 of the allocations and cuts run time roughly in half.

Jean_Michel · July 16, 2025, 1:29pm

Note that Combinatorics is sometimes not very optimized:

julia> @btime collect(Combinatorics.multiset_permutations([1,1,2],3))
  534.354 ns (44 allocations: 1.91 KiB)
3-element Vector{Vector{Int64}}:
 [1, 1, 2]
 [1, 2, 1]
 [2, 1, 1]

while

julia> @btime Combinat.arrangements([1,1,2],3)
  216.401 ns (18 allocations: 736 bytes)
3-element Vector{Vector{Int64}}:
 [1, 1, 2]
 [1, 2, 1]
 [2, 1, 1]

type or paste code here

matthias314 · July 16, 2025, 3:10pm

Combinatorics is sometimes not very optimized

If you only iterate over multiset_permutations instead of collecting the elements, then it’s not so bad for larger vectors:

julia> n = 10;

julia> @b sum(first, Combinatorics.multiset_permutations(1:$n, $n))
382.919 ms (14515255 allocs: 996.682 MiB, 15.39% gc time, without a warmup)

julia> @b sum(first, Combinat.arrangements(1:$n, $n))
943.316 ms (7257636 allocs: 570.243 MiB, 47.96% gc time, without a warmup)

(Combinat.arrangements returns a Vector, so it has to do extra work.)

Jean_Michel · July 27, 2025, 4:47pm

Indeed an iterator is useful. I have added one to Combinat. Now on my laptop:

julia> @btime sum(first, Combinatorics.multiset_permutations(1:10, 10))
  469.434 ms (14515255 allocations: 996.68 MiB)
19958400

julia> @btime sum(first, Combinat.Arrangements(1:10,10))
  187.047 ms (7257610 allocations: 498.34 MiB)
19958400

matthias314 · July 29, 2025, 11:42pm

I have turned the code posted earlier into a new function multiset_permutations for SmallCombinatorics.jl. The speed-up compared to Combinat.jl is less pronounced than for permutations:

julia> using SmallCollections, SmallCombinatorics, Combinatorics, Combinat, Chairmarks

julia> n = 10;

julia> @b sum(first, Combinatorics.multiset_permutations(1:$n, $n))
407.145 ms (14515255 allocs: 996.682 MiB, 26.80% gc time, without a warmup)

julia> @b sum(first, Combinat.Arrangements(1:$n, $n))
133.457 ms (7257610 allocs: 498.341 MiB, 11.75% gc time)

julia> v = SmallVector{16,Int8}(1:n);

julia> @b sum(first, SmallCombinatorics.multiset_permutations($v))
72.570 ms (3 allocs: 80 bytes)

Topic		Replies	Views
Combinatorics.permutations creating a lot of allocations Performance question	6	148	July 14, 2025
Product of permutations without allocation General Usage performance	11	481	December 12, 2023
Allocations when iterating Combinatorics.Combinations General Usage	8	208	December 11, 2023
Nested for loops vs Iterators.product performance Performance question	16	256	September 16, 2024
Minimizing allocations for custom powerset iterator Performance question , memory-allocation , iterators	7	521	April 14, 2022

Terribly unoptimized combination of multiset_permutations() and Iterators.product()

Related topics