Here’s a problem a colleague came to me with; they’ve got an R implementation that takes a day to run something and I said “just re-write it in Julia, I’m sure it’ll take minutes”.

A relatively simple re-write was indeed ~400x faster on my machine, but given the original problem is pretty large, and it’s a fairly short, easily describable problem of the type that many around here love I thought I’d see what Discourse thinks.

The problem is this:

There are P options, p of which can be selected by an individual, for a total of Q = \binom{P}{p} possible selections. We are given n<Q actual selections which have been made, and are looking to answer the question: for each q \in Q, what is the number of actual selections n_i which matches i = 0, 1, ..., p choices in q?

Here’s an example problem and my current implementation:

Problem setup:

```
using Chairmarks, Combinatorics, StaticArrays, StatsBase
P = 22 # Number of options
p = 5 # Number of selected options
Q = binomial(P, p) # Number of possible selections
all_selections_iterator = multiset_combinations(1:P, p) # Iterator over all combinations
# Dummy data for actual selections - assume ~1/4 are actually picked
actual_selections = unique([sort(sample(1:P, p, replace = false)) for _ ∈ 1:round(Int, Q ÷ 5)])
n = length(actual_selections)
```

My solution is to define a binary matrix D which has one row for each actual selection n and P columns, with each column holding a one in a row in which that particular option is chosen for a given selection (row). This is:

```
D = zeros(Int32, n, P)
for (i, c) ∈ pairs(actual_selections)
D[i, c] .= Int32(1)
end
```

Then the function which performs the actual work is this:

```
function all_matches(D, all_selections)
out = zeros(eltype(first(all_selections)), length(all_selections), length(first(all_selections))+1)
i = 0
@inbounds for combination ∈ all_selections
i += 1
for j ∈ 1:size(D, 1)
hits = zero(eltype(out))
for digit ∈ combination
hits += D[j, digit]
end
out[i, hits+1] += one(eltype(out))
end
end
out
end
```

which I think is pretty self explanatory: allocate a matrix `out`

of zeros of size (n \times p), then iterate through each possible selection q \in Q and for each q go through each actual selection n and increment the j-th column of `out`

by one if there are 0\leq j \leq p matching digits in the pair (n,q).

I’ve played around a bit with this and found that to my dismay using the `Combinatorics`

iterator is prohibitively slow (addings millions of allocations and a c. 100x runtime hit). The fastest I’ve come up with is to collect the iterator into a vector of `SVector`

s and run:

```
as_sv32 = [SVector{5}(Int32.(x)) for x ∈ collect(all_selections_iterator)]
@b all_matches($D, $as_sv32)
# 441.267 ms (2 allocs: 617.297 KiB, without a warmup)
```

which again isn’t too shabby and about a 400x speedup over the existing code, but still not quite enough to run it comfortably for larger problem sizes (in reality we can have P up to 60 and p up to 10).

This strikes me as the kind of problem where one might make headway with a number of different approaches (multithreading, things like LoopVectorization, low-level optimizations which I don’t know anything about) but I’m not sure what the most promising avenue for further optimization is - any ideas appreciated!