Computing pairwise Sørensen–Dice coefficient between many BitVectors

Hi all, I’m trying to match (or ideally beat) the performance of a Python/C library (anonlink) that computes the Sørensen–Dice coefficient but I’m struggling to match the performance - except when calling the same C code to carry out the inner loop.

The end goal is an efficient cuda implementation of this rather simple computation, but I figured if I can’t get close to matching the C speed I’m probably missing something fundamental.

My fastest Julia version is ~20 times slower than C speed:

using StatsBase
using SparseArrays

popcount(ba::BitVector) = count(ba)
popcount(bitvectors::Vector{BitVector}) = count.(bitvectors)
and(b1::BitVector, b2::BitVector) = b1 .& b2


function inner_dice_cmp(b1::BitVector, b2::BitVector, popcount_b1::Int64, popcount_b2::Int64)::Float64
    2.0 * popcount(and(b1, b2)) / (popcount_b1 + popcount_b2)
end


"""
Pairwise Sørensen-Dice Coefficient with thresholding.
"""
function filtered_pairwise_dice_comparison!(results::R, A::V, B::V, threshold::Float64) where {V<:AbstractVector{BitArray{1}}, R<:AbstractArray{Float64, 2}}
    popcount_a_i::Int64 = 0
    A_popcounts = popcount(A)
    B_popcounts = popcount(B)

    for i ∈ 1:size(A,1)
        b1 = A[i]
        popcount_a_i = A_popcounts[i]

        for j ∈ 1:size(B,1)
            d = inner_dice_cmp(b1, B[j], popcount_a_i, B_popcounts[j])
            if d >= threshold
                @inbounds results[i, j] = d
            end
        end
    end
    return results
end

function benchmark_pairwise(pairwise_function; narrays = 5000)
    
    array_bytes = 128

    function generate_random_bitarrays(narrays, array_bytes)
        bitarrays = [BitArray(rand(Bool, array_bytes * 8)) for i in 1:narrays];
    end

    function perturb!(bitarray, modifications=512)   
        indicies_to_modify = sample(1:length(bitarray), modifications, replace=false, ordered=true)
        bitarray[indicies_to_modify] = rand(Bool, modifications)
    end

    population = generate_random_bitarrays(2 * narrays, array_bytes)

    tiny = sample(population, 100)
    as = sample(population, narrays)
    bs = sample(population, narrays)
    
    for b in bs
        perturb!(b)
    end
    
    println(size(as), size(as[1]))
    # Call the function with a tiny dataset to ensure it is compiled
    pairwise_function(tiny, tiny)

    time_elapsed = @elapsed results = pairwise_function(as, bs)
    
    println("Total comparisons: $(length(as)*length(bs))")
    println("time elapsed: ", time_elapsed)
    println("Comparison rate: $(length(as)*length(bs)/(1e6*time_elapsed)) million/second")

end

@info "Benchmarking pure Julia sparse vector implementation"
function f5(a, b)
    results = spzeros(Float64, length(a), length(b))
    filtered_pairwise_dice_comparison!(results, a, b, 0.95)
    return results
end
benchmark_pairwise(f5, narrays=10000);

On my machine I get around 9 million comparisons per second from that code, and 100 to 120 million/s with Python+C (pip install anonlink; python -m anonlink.benchmark).

Appreciate any hints to speed it up. Early CUDA version is currently running around 800 million/second without data transfer, and 200 million/second including copying (sparse) data back to host.

1 Like

Hi, I think the and() method for BitVectors allocates a new bitvector which could be avoided with a fold method or similar non-allocating pattern

basically combining count(a .& b) into one call that works without allocating an intermediate result vector

It’s not super-easy to write a faster popcountand(b,c) which combines these, mapreduce(&, +, b, c) is 30x slower. But if you have a buffer to write into (which you could re-use?) that helps:

julia> @btime count(b .& c) setup = (b=randn(10^6).>0; c=randn(10^6).>0);
  18.190 μs (3 allocations: 122.23 KiB)

julia> @btime count(d .= b .& c) setup = (b=randn(10^6).>0; c=randn(10^6).>0; d=similar(c));
  11.858 μs (0 allocations: 0 bytes)
2 Likes

Oh nice - preallocating a BitVector for the and gets a decent speed up - now seeing 23M/s (1/4 the speed of Python + C).

Perhaps more of the computation in the innermost loop can be vectorized?

It’s tricky because ideally you also want to do the summation in the inner loop to avoid allocations – but that means we cannot have concurrency there