Fully Unique Combinations from Multiple Arrays

Hello!

I have a model with n terms and the ability to remove k of them. However each of the k removal options can only act on an arbitrary list of model terms. I want a memory-efficient way to generate every unique combination without any repetitions. For example:

Given vectors a = [1;2;3] and b=[3;4], I would like to get the output of:
[1, 3] [1,4] [2, 3] [2, 4] [3, 4]

I have found that filter!(allunique,unique,(Iterators.product(a,b))) does produce the desired result but I’m worried about the memory consumption since my actual problem involves ~40 terms and 12 removals. Combinatorics.combinations(unique([a;b]),2) gets close but includes [1, 2] in the result. Is there a way to get fitler!() to work the Combinations or skip generating such results in the first place?

Thank you in advance

Hi,

You could probably use a simple for loop (equivalent), like

julia> [(x, y) for y in b, x in a if x != y]
5-element Vector{Tuple{Int64, Int64}}:
 (1, 3)
 (1, 4)
 (2, 3)
 (2, 4)
 (3, 4)

or more generally

julia> as = (a, b);

julia> [x for x in Iterators.product(as...) if allunique(x)]  # (Presumably you could add a couple of (Iterators.)reverse to get the same order as above)
5-element Vector{Tuple{Int64, Int64}}:
 (1, 3)
 (2, 3)
 (1, 4)
 (2, 4)
 (3, 4)

which doesn’t first collect any of the product.

julia> @btime filter!(allunique, unique(Iterators.product($a, $b)));
  204.437 ns (7 allocations: 752 bytes)

julia> @btime [x for x in Iterators.product($as...) if allunique(x)];
  58.460 ns (3 allocations: 224 bytes)

Edit: Or use collect(Iterators.filter(...)) instead of filter!(..., unique(...)):

julia> @btime collect(Iterators.filter(allunique, Iterators.product($as...)))
  46.216 ns (2 allocations: 192 bytes)
5-element Vector{Tuple{Int64, Int64}}:
 (1, 3)
 (2, 3)
 (1, 4)
 (2, 4)
 (3, 4)

(or don’t collect if you don’t have to).


Edit2: I was assuming here that the entries of as (a and b) are already internally unique (all(allunique, as)). If so, then also allunique(Iterators.product(as...)) is satisfied, so unique(Iterators.product(as...)) is equivalent to collect(Iterators.product(as...)), but slower. If not, then uniquifying the as (uas = unique.(as)) will be more efficient than the same for the product (unique(Iterators.product(as...))).

You can also take a look at the package SmallCombinatorics.jl.

I’m not sure that I fully understand the problem. How many lists do you have and how many elements does each list have roughly? And does the order of the elements in the output lists matter? For instance, what output would you get in your example above if b is the same list as a?

EDIT: I now think that given n collections c_1 to c_n you are looking for all sets of size n in the collection of all sets that are of the form \{x_1,\dots,x_n\} with x_1\in c_1, …, x_n\in c_n.

So I have 12 arrays of lengths ranging from 5 to 17 filled with integers ranging from 1 to 41. Each array has only unique elements but some of the elements are shared between arrays. I would like each of the outputs to get 1 element from each array and if there are two elements that only occur in a single array, I don’t want them together. I do not believe order will matter.

The output arrays will be used as indices for solving different sets of linear systems of equations and I’m trying to wasting computational time on combinations that won’t be solvable. I appreciate everyone’s thoughts thus far and hope this helps clarify what I’m hoping for

You should write down simple code that does what you want, albeit slowly, together with small examples, input and output copy-pasted from the REPL.

Then, I’d like to know the output size you desire. (if your output happens to be ~1e15 many combinations, then a memory efficient iteration won’t help you.)

Your explanation is extremely unclear and your purported “definition by implementation” does not work:

julia> a = [1;2;3]; b=[3;4]; filter!(allunique,unique,(Iterators.product(a,b)))
ERROR: MethodError: no method matching filter!(::typeof(allunique), ::typeof(unique), ::Base.Iterators.ProductIterator{Tuple{Vector{Int64}, Vector{Int64}}})
The function `filter!` exists, but no method is defined for this combination of argument types.

An example that I want to see included is a=[1,2];b=[1,2];: Do you want [[1,2], [2,1]] or do you want [[1,2]]?

I apologize, I don’t currently have code that works for my goal, even slowly and for the typo in my initial post. I had meant to type:

julia> a = [1;2;3]; b=[3;4]; filter!(allunique,unique(Iterators.product(a,b)))

However, I’ve realized even this breaks if I introduce c=[1;2;4].
I would want something like the following:

julia> a = [1;2;3]; b=[3;4]; c=[1;2;4]; f(a,b,c)
4-element Vector{Tuple{Int64, Int64, Int64}}:
 (1, 2, 3)
 (1, 3, 4)
 (1, 2, 4)
 (2, 3, 4)
julia> a = [1;2]; b=[1;2]; f(a,b)
1-element Vector{Tuple{Int64,Int64}}:
 (1, 2)

Just using combinations, there are more than 30e6 but my hope is that more robust combinatorics and cleverness can reduce this number. These are the actual arrays for my problem:

a = [3,4,6];
b = [19,20,39,40];
c = [12,17,23];
d = [13,15,16,17,21,24,25];
e = [12,13,15,16,17,25,26];
f = [29,30,38,41];
g = [13,14,15,16,17];
h = [23,24,26,29,41];
i = [23,24,25,29,30];
j = [32,34,35,36,41];
k = [31,32,33,34,35,40,41];
l = [31,32,33,34,36,39,40,41];

Thank you for the continued help and thoughts.

Do you mean something functionally equivalent to

julia> unique(([sort(SVector(x)) for x in Iterators.product([1;2;3], [3;4], [1;2;4]) if allunique(x)]))
4-element Vector{SVector{3, Int64}}:
 [1, 2, 3]
 [1, 2, 4]
 [1, 3, 4]
 [2, 3, 4]

julia> unique(([sort(SVector(x)) for x in Iterators.product([1;2], [1;2]) if allunique(x)]))
1-element Vector{SVector{2, Int64}}:
 [1, 2]

(but more (memory) efficient)?

Yes! Thank you! This produces the desired results for subsets of my problem but is too expensive for my full problem

This is not particularly efficient in terms of runtime since we still first generate all possible combinations, and then only keep the valid ones, but for memory consumption this should be fine:

using StaticArrays

function generate(as)
    N = length(as)
    T = eltype(eltype(as))
    s = Set{SVector{N, T}}()
    _generate!(s, Iterators.product(as...))  
    # Function boundary for type stability 
    # (We will have to recompile for every value of N though, but this is relatively fast)
   return s
end

function _generate!(s::Set{SVector{N, T}}, c) where {N, T}
    for x in c
        allunique(x) || continue
        sx = sort(SVector{N, T}(x))
        push!(s, sx)
    end
    return s
end
julia> as = [a, b, c, d, e, f, g, h, i, j, k, l]  # (or tuple (of (S)Vectors), or ...)
12-element Vector{Vector{Int64}}:
 [3, 4, 6]
 [19, 20, 39, 40]
 [12, 17, 23]
 [13, 15, 16, 17, 21, 24, 25]
 [12, 13, 15, 16, 17, 25, 26]
 [29, 30, 38, 41]
 [13, 14, 15, 16, 17]
 [23, 24, 26, 29, 41]
 [23, 24, 25, 29, 30]
 [32, 34, 35, 36, 41]
 [31, 32, 33, 34, 35, 40, 41]
 [31, 32, 33, 34, 36, 39, 40, 41]

julia> @time generate(as)
  7.380655 seconds (2.21 M allocations: 871.420 MiB, 0.85% gc time, 7.67% compilation time)
Set{SVector{12, Int64}} with 2707644 elements:
  [6, 13, 17, 19, 21, 25, 26, 32, 36, 38, 40, 41]
  [4, 14, 15, 23, 24, 26, 29, 30, 33, 36, 40, 41]
  [6, 15, 20, 23, 25, 26, 29, 30, 31, 33, 35, 38]
  [4, 14, 16, 21, 23, 24, 25, 29, 34, 35, 36, 40]
  [3, 13, 14, 17, 20, 24, 25, 26, 29, 34, 39, 40]
  [4, 12, 16, 17, 20, 21, 23, 30, 32, 36, 38, 41]
  [6, 12, 17, 19, 23, 24, 29, 30, 35, 38, 39, 41]
  [4, 12, 13, 16, 24, 26, 29, 34, 36, 38, 39, 40]
  [6, 12, 14, 16, 25, 26, 29, 30, 31, 35, 39, 41]
  [3, 15, 23, 24, 25, 30, 33, 35, 36, 38, 39, 41]
  [6, 12, 13, 15, 21, 23, 30, 34, 35, 38, 39, 40]
  [4, 13, 14, 15, 17, 19, 23, 25, 32, 36, 38, 40]
  [4, 12, 14, 16, 17, 20, 26, 30, 31, 33, 38, 41]
  [3, 12, 13, 23, 25, 26, 31, 32, 38, 39, 40, 41]
  [3, 13, 15, 20, 21, 23, 25, 26, 31, 33, 35, 41]
  [6, 12, 13, 15, 17, 26, 30, 33, 34, 35, 39, 41]
  [4, 12, 16, 17, 20, 23, 25, 26, 29, 32, 33, 34]
  [3, 14, 15, 19, 21, 23, 25, 32, 33, 34, 38, 41]
  [6, 13, 14, 17, 20, 21, 23, 24, 29, 31, 35, 36]
  [4, 16, 17, 21, 24, 25, 32, 35, 36, 38, 40, 41]
  [3, 12, 16, 17, 20, 23, 24, 29, 33, 34, 38, 41]
  [3, 14, 17, 21, 23, 24, 25, 31, 34, 35, 38, 40]
  [4, 13, 14, 17, 21, 25, 29, 32, 36, 38, 39, 40]
  [6, 12, 14, 16, 17, 20, 25, 29, 34, 35, 38, 41]
  [6, 14, 17, 19, 21, 23, 26, 29, 31, 35, 38, 41]
  ⋮

julia> @time s = generate(as);  # Compilation only adds many small allocations
  6.904371 seconds (68 allocations: 759.835 MiB, 0.95% gc time)

julia> (length(s) * sizeof(eltype(s))) / 2^20  # Minimal size of output, in MiB
247.8922119140625

julia> Base.summarysize(s) / 2^20  # Including Set overhead
388.00011444091797

So we are (in total) allocating about three times what we actually want to output.

I think you should should use an appropriate representation (one such combination is a 64 bit number), and then store the thing and reuse it.

For example:

julia> items = [a,b,c,d,e,f,g,h,i,j,k,l]; 

julia> function gen!(res, abc, idx, current)
       if idx > length(abc) 
           push!(res, current)
           return
       end
       choices = abc[idx]
       for other in choices
           (current & 1<<other) != 0 && continue
           gen!(res, abc, idx+1, current | (1<<other))
       end
       end
julia> function gen(abc)
       res = Set{UInt64}()
       current = UInt(0)
       idx = 1
       gen!(res, abc, idx, current)
       res2 = collect(res)
       sort!(res2)
       res2
       end
julia> sizeof(gen(items))
21661152
julia> @time gen(items);
  0.935008 seconds (64 allocations: 111.821 MiB, 9.18% gc time)

Thats mere 21 megabytes, and runs in < 1 second for me, with ~ 2.7e6 elements. Just keep the resulting thing in a global const.

That’s what the type SmallBitSet from SmallCollections.jl does. Everything looks like sets, and the bit manipulations are under the hood. This may make the code more readable.

using SmallCollections

const S = SmallBitSet{UInt64}

Base.hash(s::S, h::UInt) = fasthash(s, h)  # faster hash (not compatible with `AbstractSet`)

function gen!(res, abc, idx, current)
    if idx > length(abc)
        push!(res, current)
        return
    end
    choices = abc[idx]
    for other in choices
        new = push(current, other)  # add element to set
        new != current && gen!(res, abc, idx+1, new)
    end
end

function gen(abc)
     res = Set{S}()
     gen!(res, abc, 1, S())
     res
end

items = S.([a,b,c,d,e,f,g,h,i,j,k,l])

(I skip the conversion from Set to Vector.) The run time is as before:

julia> @b gen($items)  # foobar_lv2 (with result as a set)
1.453 s (55 allocs: 70.509 MiB, 0.34% gc time, without a warmup)

julia> @b gen($items)  # this version
1.411 s (55 allocs: 70.505 MiB, 0.29% gc time, without a warmup)