[Nerdsnipe warning] Speed up short vector comparisons to beat R

Here’s a problem a colleague came to me with; they’ve got an R implementation that takes a day to run something and I said “just re-write it in Julia, I’m sure it’ll take minutes”.

A relatively simple re-write was indeed ~400x faster on my machine, but given the original problem is pretty large, and it’s a fairly short, easily describable problem of the type that many around here love I thought I’d see what Discourse thinks.

The problem is this:

There are P options, p of which can be selected by an individual, for a total of Q = \binom{P}{p} possible selections. We are given n<Q actual selections which have been made, and are looking to answer the question: for each q \in Q, what is the number of actual selections n_i which matches i = 0, 1, ..., p choices in q?

Here’s an example problem and my current implementation:

Problem setup:

using Chairmarks, Combinatorics, StaticArrays, StatsBase

P = 22 # Number of options
p = 5  # Number of selected options
Q = binomial(P, p) # Number of possible selections

all_selections_iterator = multiset_combinations(1:P, p) # Iterator over all combinations

# Dummy data for actual selections - assume ~1/4 are actually picked 
actual_selections = unique([sort(sample(1:P, p, replace = false)) for _ ∈ 1:round(Int, Q ÷ 5)])
n = length(actual_selections)

My solution is to define a binary matrix D which has one row for each actual selection n and P columns, with each column holding a one in a row in which that particular option is chosen for a given selection (row). This is:

D = zeros(Int32, n, P)
for (i, c) ∈ pairs(actual_selections)
    D[i, c] .= Int32(1)
end

Then the function which performs the actual work is this:

function all_matches(D, all_selections)
	out = zeros(eltype(first(all_selections)), length(all_selections), length(first(all_selections))+1)

	i = 0
	@inbounds for combination ∈ all_selections
		i += 1
		for j ∈ 1:size(D, 1)
			hits = zero(eltype(out))
			for digit ∈ combination
				hits += D[j, digit]
			end
			out[i, hits+1] += one(eltype(out))
		end
	end

	out
end

which I think is pretty self explanatory: allocate a matrix out of zeros of size (n \times p), then iterate through each possible selection q \in Q and for each q go through each actual selection n and increment the j-th column of out by one if there are 0\leq j \leq p matching digits in the pair (n,q).

I’ve played around a bit with this and found that to my dismay using the Combinatorics iterator is prohibitively slow (addings millions of allocations and a c. 100x runtime hit). The fastest I’ve come up with is to collect the iterator into a vector of SVectors and run:

as_sv32 = [SVector{5}(Int32.(x)) for x ∈ collect(all_selections_iterator)]
@b all_matches($D, $as_sv32)
# 441.267 ms (2 allocs: 617.297 KiB, without a warmup)

which again isn’t too shabby and about a 400x speedup over the existing code, but still not quite enough to run it comfortably for larger problem sizes (in reality we can have P up to 60 and p up to 10).

This strikes me as the kind of problem where one might make headway with a number of different approaches (multithreading, things like LoopVectorization, low-level optimizations which I don’t know anything about) but I’m not sure what the most promising avenue for further optimization is - any ideas appreciated!

4 Likes

I stole this from one of the stale PRs to Combinatorics.jl. Does this help?

@inline function Base.iterate(c::Combinatorics.Combinations, s = [min(c.t - 1, i) for i in 1:c.t])
    if c.t == 0 # special case to generate 1 result for t==0
        isempty(s) && return (s, [1])
        return
    end
    for i in c.t:-1:1
        s[i] += 1
        if s[i] > (c.n - (c.t - i))
            continue
        end
        for j in i+1:c.t
            s[j] = s[j-1] + 1
        end
        break
    end
    s[1] > c.n - c.t + 1 && return
    (s, s)
end
1 Like

That doesn’t seem to change anything unfortunately - maybe I’m doing it wrong? I see

 57.974889 seconds (963.13 M allocations: 14.382 GiB, 0.44% gc time, 0.37% compilation time)

After simply evaluating that method definition, or import Base:iterate and then evaluating that definition again.

If I switch the D matrix’s eltype to Bool I get ~33% short time. If P <= 64 we can simply encode the bits into an Int and get another ~40% reduction in runtime:

function to_binary(choices)
    mapreduce(x->UInt(1) << (x-1), |, choices, init=UInt(0))
end

# https://graphics.stanford.edu/~seander/bithacks.html#NextBitPermutation
function next_lexicographic(x::T) where T
    t = x | (x-one(T))
    return (t+one(T)) | (((~t & -~t) - one(T)) >> (trailing_zeros(x) + one(T)))
end

function all_matches2(choices_binary, P,p)
	out = zeros(Int, binomial(P,p), p+1)

    current_selection = (UInt(1) << p) - UInt(1)
	for i in axes(out,1)
		for choice in choices_binary
			out[i, count_ones(current_selection & choice)+1] += one(eltype(out))
		end
        current_selection = next_lexicographic(current_selection)
	end

	out
end

Results:

julia> @b all_matches($D, $as_sv32) # eltype(D) == Int32
328.113 ms (2 allocs: 1.206 MiB, without a warmup)

julia> @b all_matches($D, $as_sv32) # eltype(D) == Bool
204.319 ms (2 allocs: 1.206 MiB, without a warmup)

julia> @b all_matches2($(to_binary.(actual_selections)), $P,$p)
120.640 ms (2 allocs: 1.206 MiB)

Note that this gives a different ordering.

Edit: In principle this scream for some threading because each q is essentially independent. However, we generate one permutation after the next and thus make it a serial process…

1 Like

Interesting, thank you - it looks like things are CPU dependent here potentially?

I’m getting a 3x slowdown when using a BitMatrix for D, and the all_matches2 solution only shaves off about <10% of runtime (420ms vs 460ms) on my machine.

Multithreading should be easily doable on my original approach where all_selections is materialized, right? Basically just chunk up out and all_selections into n chunks for n threads and combine after? I’m currently trying to work out how to do this with OhMyThreads which I’d been meaning to look at anyway.

for P=60 and p=10, there’s 75B elements in Q, so the answer here is a 75B by 10 matrix, right? Are all of those 750B numbers useful? (That’s ~3TB as a dense 32-bit integer matrix). I wonder if there is a different question that is equally useful to your downstream task that can scale better (e.g. in terms of n rather than Q, since n sounds like it will be limited by an actual process in the world).

A BitMatrix is not a Matrix{Bool}! I just changed the construction of D slightly:

D = zeros(Bool, n, P) # note the Bool instead of Int32
for (i, c) ∈ pairs(actual_selections)
    D[i, c] .= true
end
1 Like

Ah yes of course - that to me also runs at roughly the same speed as the fastest solution then, i.e. I don’t see any difference.

60/10 is rather extreme, we’d hopefully not have both 60 and 10 at the same time.

But yes essentially the downstream task depends on the entire distribution of each of the columns of the output matrix, so I don’t see a way of shortcutting things, although I agree the whole thing seems rather wasteful at first sight.

1 Like

To me it is very surprising that the difference is so small between my solution and your code because my code has substantially less memory accesses. Sure they all will be in the cache but still…

For a fair comparison you should benchmark all the parts you actually run. Like construction of D and the materialization of as_sv32. Did you consider that as well?

You could try basic multithreading like this:

function all_matches(D, all_selections)
	out = zeros(Int, length(all_selections), length(first(all_selections))+1)

	@inbounds Threads.@threads for i in eachindex(all_selections)
        combination = all_selections[i]
		for j ∈ 1:size(D, 1)
			hits = zero(eltype(out))
			for digit ∈ combination
				hits += D[j, digit]
			end
			out[i, hits+1] += one(eltype(out))
		end
	end
	out
end

This speeds up the computation for me but the scaling is not very good. I get 153ms with 8 threads vs. 290ms single-threaded.

1 Like

I think I’d do

julia> function mkbitmatrix(selections)
       n = length(selections)
       P = 64
       #edit: original posting had an error with length(first(selections))
       res = falses(P,n)
       for (i,c) in enumerate(selections)
       res[c,i] .= true
       end
       res
       end
julia> act= mkbitmatrix(actual_selections); poss = mkbitmatrix(all_selections_iterator);
julia> function matchme(actual_selections, possible_selections, p)
       out = zeros(Int32, p+1, size(possible_selections,2))
       @inbounds for i = 1:length(possible_selections.chunks)
       a = possible_selections.chunks[i]
       for j = 1:length(actual_selections.chunks)
       b = actual_selections.chunks[j]
       hits = count_ones(a & b)
       out[hits+1, i ] += 1
       end
       end
       out
       end
matchme (generic function with 3 methods)

julia> @b matchme($act, $poss, $p)
103.208 ms (3 allocs: 617.328 KiB)

julia> function matchmeT(actual_selections, possible_selections, p)
       out = zeros(Int32, p+1, size(possible_selections,2))
       @inbounds Threads.@threads for i = 1:length(possible_selections.chunks)
       a = possible_selections.chunks[i]
       for j = 1:length(actual_selections.chunks)
       b = actual_selections.chunks[j]
       hits = count_ones(a & b)
       out[hits+1, i ] += 1
       end
       end
       out
       end
matchmeT (generic function with 1 method)

julia> @b matchmeT($act, $poss, $p)
12.736 ms (85 allocs: 626.016 KiB)

For comparison:

julia> out = matchme(act, poss, p);
julia> @b all_matches($D, $as_sv32)
118.044 ms (3 allocs: 617.328 KiB)

julia> out2=all_matches(D, as_sv32);

julia> out' == out2
true

julia> sum(out)*1.0
1.2561318e8

So unfortunately we’re not really faster; and the thing takes roughly one nanosecond per increment in out. This ain’t bad, but not great either.

What I think is going wrong is that we’re bound on latency in the store queue for updates to out. So the appropriate way would be to do something like

julia> function matchme(actual_selections, possible_selections, ::Val{p}) where p
       out = zeros(Int32, p+1, size(possible_selections,2))
       tmp = MVector{p+1, Int32}(undef)
       @inbounds for i = 1:length(possible_selections.chunks)
       a = possible_selections.chunks[i]
       tmp .= 0
       for j = 1:length(actual_selections.chunks)
       b = actual_selections.chunks[j]
       hits = count_ones(a & b)
       tmp[hits+1] += 1
       end
       out[:,i] .= tmp
       tmp .= 0
       end
       out
       end

i.e. have a temporary accumulator for the hits inside a register instead of hitting memory / the store queue.

The above doesn’t work, unfortunately. You probably need to bit-pack somehow.

1 Like

This is what I get simply replacing Threads.@threads by Polyester.@batch:

julia> @b all_matches_serial($D, $as_sv32)
100.989 ms (2 allocs: 1.206 MiB)

julia> @b all_matches_threads($D, $as_sv32)
50.541 ms (53 allocs: 1.211 MiB)

julia> @b all_matches_batch($D, $as_sv32)
22.147 ms (2 allocs: 1.206 MiB)

(10 threads here)

And this seems slightly faster than @threads, by chunking the workload, and making the
access to out more local:

julia> using ChunkSplitters

julia> function all_matches_chunks(D, all_selections; nchunks=2*Threads.nthreads())
           out = zeros(Int, length(all_selections), length(first(all_selections))+1)
           @sync for inds in chunks(eachindex(all_selections); n = nchunks)
               out_chunk = zeros(Int, length(first(all_selections))+1)
               @spawn for i in inds
                   combination = all_selections[i]
                   for j ∈ 1:size(D, 1)
                       hits = zero(eltype(out))
                       for digit ∈ combination
                           hits += D[j, digit]
                       end
                       out_chunk[hits+1] += one(eltype(out))
                end
                out[i,:] .= out_chunk
               end
           end
           return out
       end
all_matches_chunks (generic function with 1 method)

julia> @b all_matches_chunks($D, $as_sv32; nchunks=4*Threads.nthreads())
33.443 ms (292 allocs: 1.233 MiB)

julia> @b all_matches_chunks($D, $as_sv32; nchunks=10*Threads.nthreads())
27.726 ms (713 allocs: 1.273 MiB)

julia> @b all_matches_chunks($D, $as_sv32; nchunks=100*Threads.nthreads())
25.216 ms (7015 allocs: 1.883 MiB)
2 Likes

Constructing D seems very cheap (1-2 seconds even for substantially larger problems, say P=50, p = 6. Materializing the all_selections_iterator is a fair bit slower (I get ~1 min for the above size), but the all_matches bit should have hours of runtime for those larger sizes so this is still a rounding error I believe.

Muhaha, I found the fast accumulator:

julia> function matchmeX(actual_selections, possible_selections, ::Val{p}) where p
              out = zeros(Int32, p+1, size(possible_selections,2))
              mask = ~((-1%UInt64) << 10)
              @inbounds for i = 1:length(possible_selections.chunks)
                   a = possible_selections.chunks[i]
                   for jc = 1:1023:length(actual_selections.chunks)
                       tmp = 0
                       for j=jc:min(jc+1023-1, length(actual_selections.chunks))
                           b = actual_selections.chunks[j]
                           hits = count_ones(a & b)
                           tmp += (1 << ( (hits*10) & 63) )
                       end #j
                       for k=0:p
                       out[k+1, i] += (tmp >>> (k*10)) & mask
                       end
                       tmp = 0
                   end
               end     
               out
              end
julia> @b matchmeX($act, $poss, $Val(p))
32.279 ms (3 allocs: 617.328 KiB)

julia> t= matchmeX(act, poss, Val(p));

julia> t'==out2
true

That’s single-threaded. Multi-threaded becomes

julia> function matchmeXT(actual_selections, possible_selections, ::Val{p}) where p
              out = zeros(Int32, p+1, size(possible_selections,2))
              mask = ~((-1%UInt64) << 10)
              @inbounds Threads.@threads for i = 1:length(possible_selections.chunks)
                   a = possible_selections.chunks[i]
                   for jc = 1:1023:length(actual_selections.chunks)
                       tmp = 0
                       for j=jc:min(jc+1023-1, length(actual_selections.chunks))
                           b = actual_selections.chunks[j]
                           hits = count_ones(a & b)
                           tmp += (1 << ( (hits*10) & 63) )
                       end #j
                       for k=0:p
                       out[k+1, i] += (tmp >>> (k*10)) & mask
                       end
                       tmp = 0
                   end
               end     
               out
              end
matchmeXT (generic function with 1 method)

julia> @b matchmeXT($act, $poss, $Val(p))
3.858 ms (85 allocs: 626.016 KiB)

You will need to play around a little with the bit-packing.

PS. For larger p, you will need to accumulate in UInt128 or effectively UInt256. Alas, bitshifting such large registers is not so fast anymore. Maybe a lookup table indexed by hit? That becomes a vector shuffle. At some point you’ll need to decide whether you target GPU, avx2, avx512, or arm (and which one…).

For larger actual_selections that don’t fit into L1d anymore, you will need to apply some blocking (similar as matmul).

I think you don’t care about N > 64?

5 Likes

That will take me a while to get to grips with… but thank you!

I’m unfortunately stumbling before I even get there, your first function:

function mkbitmatrix(selections)
    n = length(selections)
    P = length(first(selections))
    res = falses(P,n)
    for (i,c) in enumerate(selections)
        res[c,i] .= true
    end
    res
end

errors for me. This is essentially the construction of my D matrix above, isn’t it? I might misunderstand but I think P = length(first(selections)) doesn’t work - this gives p instead of P (apologies for less than ideal variable naming here), P is a parameter of the problem that can’t necessarily be inferred from any set of selections (as they might not include the “last” option).

EDIT: something like this works for me:

function mkbitmatrix(selections, P)
    n = length(selections)
    res = falses(n, P)
    for (i,c) in enumerate(selections)
        res[i, c] .= true
    end
    res
end

(although now I might have messed up the dimensions of the matrix with respect to the later bits of your code)

If by N here you mean P (i.e. the number of total choices that can be made) then yes, that should be 60 at most I believe, and more often in the region of 40-50, with p (number of actual choices selected) between 3 and 6.

Apologies! I somehow copy-pasted the wrong version, fixed above.

I hardcoded P=64 in the mkbitmatrix.

This means that each selection is encoded as UInt64. Otherwise this is the same as the D matrix.

Due to the extra padding, we don’t need to do as much work on counting.

Ideally we’d use 32 bit per selection, not 64 – but that would be slightly annoying and you ultimately care about 32 < P <=64, right?

1 Like

Yes correct, P will always almost be larger than 32.

This is actually a very interesting point - if I wanted to spec out a machine that basically solves this problem over an over, what would I go for? I wouldn’t have thought of GPUs as the right hardware for this?

Here is another stab at this nerdsnipe:

using Chairmarks, Combinatorics, StatsBase

function list_to_compressed(pred::F, list) where {F<:Function}
    [(foldl((r,e)->(r |= UInt(1)<<(e-1); r), v; init=zero(UInt)),i) for (i,v) in pairs(list) if pred(v)]
end

function list_to_compressedinverted(list, P)
    [list_to_compressed(Base.Fix1(in, i), list) for i in 1:P]
end

@inline function compressed_intersect(val, l1, l2)
    t = l1 & l2
    trailing_zeros(t) < val-1 && return nothing
    return count_ones(t)
end

function shadow_counts(L1, L2, maxelement, maxintersect)
    P, p, n, Q = maxelement, maxintersect, length(L1), length(L2)
    Q1inv = list_to_compressedinverted(L1, P)
    Q2inv = list_to_compressedinverted(L2, P)
    res = zeros(Int, Q, p+1)

    for i in 1:P
        for j in Q1inv[i]
            for k in Q2inv[i]
                s = compressed_intersect(i, j[1], k[1])
                isnothing(s) && continue
                res[k[2], s+1] += 1
            end
        end
    end
    for r in eachrow(res)
        r[1] = n - sum(r)
    end
    res
end

function nerd_problem3(P, p, alpha = 0.20)
    Q = binomial(P, p)
    all_selections_iterator = multiset_combinations(1:P, p)
    actual_selections = unique([sort(sample(1:P, p, replace = false)) for _ ∈ 1:round(Int, alpha*Q)])
    all_selections = collect(all_selections_iterator)
    n = length(actual_selections)
    shadow_counts(actual_selections, all_selections, P, p)
end

with these functions:

julia> nerd_problem3(22, 5, 0.2)
26334×6 Matrix{Int64}:
 22728  2176  1172  244  14  0
 22722  2167  1173  253  19  0
 22751  2113  1224  229  17  0
     ⋮                       ⋮
 22649  2193  1249  223  20  0
 22679  2159  1247  237  11  1
 22676  2148  1253  243  13  1

returns the matrix. Locally, it benchmarked better than some alternatives, but it is best tested on the OP’s machine.

The idea is to use bit fiddling with 64-bit UInts. A crucial bit is the decomposition of the overlapping subsets according to the lowest value of their intersection. Since any intersecting sets have such a value, this is a partition. We partition/expand the input lists to several bit-compressed lists which are shorter (because they are filtered. In total P*length(list) for all lists), but this allows consideration of only the cross products of lists including some element in 1:P.

The code might be a better explanation than the above paragraph.

BTW the algorithm is very parallelizable. Both list_to_compressedinverted and the outer for loop can be multithreaded. For further optimization, the all_selections can be generated with an iterator and never materialized (even the expanded/filtered lists are easy to generate with an iterator).

2 Likes