Computing pairwise Sørensen–Dice coefficient between many BitVectors

Brian_Thorne · August 26, 2020, 10:31am

Hi all, I’m trying to match (or ideally beat) the performance of a Python/C library (anonlink) that computes the Sørensen–Dice coefficient but I’m struggling to match the performance - except when calling the same C code to carry out the inner loop.

The end goal is an efficient cuda implementation of this rather simple computation, but I figured if I can’t get close to matching the C speed I’m probably missing something fundamental.

My fastest Julia version is ~20 times slower than C speed:

using StatsBase
using SparseArrays

popcount(ba::BitVector) = count(ba)
popcount(bitvectors::Vector{BitVector}) = count.(bitvectors)
and(b1::BitVector, b2::BitVector) = b1 .& b2


function inner_dice_cmp(b1::BitVector, b2::BitVector, popcount_b1::Int64, popcount_b2::Int64)::Float64
    2.0 * popcount(and(b1, b2)) / (popcount_b1 + popcount_b2)
end


"""
Pairwise Sørensen-Dice Coefficient with thresholding.
"""
function filtered_pairwise_dice_comparison!(results::R, A::V, B::V, threshold::Float64) where {V<:AbstractVector{BitArray{1}}, R<:AbstractArray{Float64, 2}}
    popcount_a_i::Int64 = 0
    A_popcounts = popcount(A)
    B_popcounts = popcount(B)

    for i ∈ 1:size(A,1)
        b1 = A[i]
        popcount_a_i = A_popcounts[i]

        for j ∈ 1:size(B,1)
            d = inner_dice_cmp(b1, B[j], popcount_a_i, B_popcounts[j])
            if d >= threshold
                @inbounds results[i, j] = d
            end
        end
    end
    return results
end

function benchmark_pairwise(pairwise_function; narrays = 5000)
    
    array_bytes = 128

    function generate_random_bitarrays(narrays, array_bytes)
        bitarrays = [BitArray(rand(Bool, array_bytes * 8)) for i in 1:narrays];
    end

    function perturb!(bitarray, modifications=512)   
        indicies_to_modify = sample(1:length(bitarray), modifications, replace=false, ordered=true)
        bitarray[indicies_to_modify] = rand(Bool, modifications)
    end

    population = generate_random_bitarrays(2 * narrays, array_bytes)

    tiny = sample(population, 100)
    as = sample(population, narrays)
    bs = sample(population, narrays)
    
    for b in bs
        perturb!(b)
    end
    
    println(size(as), size(as[1]))
    # Call the function with a tiny dataset to ensure it is compiled
    pairwise_function(tiny, tiny)

    time_elapsed = @elapsed results = pairwise_function(as, bs)
    
    println("Total comparisons: $(length(as)*length(bs))")
    println("time elapsed: ", time_elapsed)
    println("Comparison rate: $(length(as)*length(bs)/(1e6*time_elapsed)) million/second")

end

@info "Benchmarking pure Julia sparse vector implementation"
function f5(a, b)
    results = spzeros(Float64, length(a), length(b))
    filtered_pairwise_dice_comparison!(results, a, b, 0.95)
    return results
end
benchmark_pairwise(f5, narrays=10000);

On my machine I get around 9 million comparisons per second from that code, and 100 to 120 million/s with Python+C (pip install anonlink; python -m anonlink.benchmark).

Appreciate any hints to speed it up. Early CUDA version is currently running around 800 million/second without data transfer, and 200 million/second including copying (sparse) data back to host.

josuagrw · August 26, 2020, 1:03pm

Hi, I think the and() method for BitVectors allocates a new bitvector which could be avoided with a fold method or similar non-allocating pattern

basically combining count(a .& b) into one call that works without allocating an intermediate result vector

mcabbott · August 26, 2020, 4:12pm

It’s not super-easy to write a faster popcountand(b,c) which combines these, mapreduce(&, +, b, c) is 30x slower. But if you have a buffer to write into (which you could re-use?) that helps:

julia> @btime count(b .& c) setup = (b=randn(10^6).>0; c=randn(10^6).>0);
  18.190 μs (3 allocations: 122.23 KiB)

julia> @btime count(d .= b .& c) setup = (b=randn(10^6).>0; c=randn(10^6).>0; d=similar(c));
  11.858 μs (0 allocations: 0 bytes)

Brian_Thorne · August 26, 2020, 7:35pm

Oh nice - preallocating a BitVector for the and gets a decent speed up - now seeing 23M/s (1/4 the speed of Python + C).

Perhaps more of the computation in the innermost loop can be vectorized?

josuagrw · August 27, 2020, 3:33pm

It’s tricky because ideally you also want to do the summation in the inner loop to avoid allocations – but that means we cannot have concurrency there

Topic		Replies	Views
Hand written loop slower than bitVector broadcast Performance question	15	708	January 18, 2024
BitVector vs Vector{Bool} as default on comparison operations Performance	8	9182	November 2, 2020
Optimization: How to make sure XOR is performed in chunks New to Julia	33	2309	July 6, 2022
Performance hacking `(&)(a::BitArray, b::BitArray)` Performance	14	228	October 22, 2024
Faster conversion from boolean `SparseMatrixCSC` to `BitMatrix` Performance	11	197	November 5, 2024

Computing pairwise Sørensen–Dice coefficient between many BitVectors

Related topics