`issubset(::BitSet,::BitSet)` is slow and allocates unnecessarily

The implementation of issubset between two BitSet: https://github.com/JuliaLang/julia/blob/3909294e78d223c740b7e7a4d00eb2d4771055a7/base/bitset.jl#L419
first intersects the sets. This unnecessarily allocates memory to compute an entirely new BitSet for the intersection. It ended up costing me 30% CPU time on my workload.

It seems that on ‘standard’ sizes of BitSet, the following implementation is often 6x faster and has 0 allocations.

function issubset_fast(s::BitSet,t::BitSet)
    s == t && return true
    s.offset < t.offset && return false
    length(s.bits) > length(t.bits) && return false
    length(s) > length(t) && return false
    return all(e -> e ∈ t, s)
end

s1 = BitSet([2i for i=1:100]);
s2 = BitSet([7i+70 for i=1:15]);
s3 = BitSet([20i+70 for i=1:6]);
s4 = BitSet([2i for i=1:99]);

@btime issubset($s1, $s1);
@btime issubset_fast($s1, $s1); 
@btime issubset($s2, $s1);
@btime issubset_fast($s2, $s1); 
@btime issubset($s3, $s1);
@btime issubset_fast($s3, $s1); 
@btime issubset($s4, $s1);
@btime issubset_fast($s4, $s1); 

Giving:

  85.817 ns (3 allocations: 160 bytes)
  8.865 ns (0 allocations: 0 bytes)
  86.080 ns (3 allocations: 160 bytes)
  14.888 ns (0 allocations: 0 bytes)
  86.146 ns (3 allocations: 160 bytes)
  19.714 ns (0 allocations: 0 bytes)
  88.973 ns (3 allocations: 160 bytes) 
  142.452 ns (0 allocations: 0 bytes)

On the last issubset(s4, s1) the implementation in Base is still faster, I suppose it has been optimized for this case.

Still, I’m hoping that someone who understands the internals of BitSet better than me can make a general-purpose non-allocating version that’s faster on all cases?

3 Likes

maybe something like this?

using BenchmarkTools

function issubset_fast(a::BitSet, b::BitSet)
    n = length(a.bits)
    shift = b.offset - a.offset
    i, j = shift, shift + length(b.bits)

    f(a, b) = a == a & b
    return (
        all(@inbounds iszero(a.bits[i]) for i in 1:min(n, i)) &&
        all(@inbounds f(a.bits[i], b.bits[i - shift]) for i in max(1, i+1):min(n, j)) &&
        all(@inbounds iszero(a.bits[i]) for i in max(1, j+1):n))
end

s1 = BitSet([2i for i=1:100]);
s2 = BitSet([7i+70 for i=1:15]);
s3 = BitSet([20i+70 for i=1:6]);
s4 = BitSet([2i for i=1:99]);

@btime issubset($s1, $s1);
@btime issubset_fast($s1, $s1);
@btime issubset($s2, $s1);
@btime issubset_fast($s2, $s1);
@btime issubset($s3, $s1);
@btime issubset_fast($s3, $s1);
@btime issubset($s4, $s1);
@btime issubset_fast($s4, $s1);

On my machine:

  133.868 ns (3 allocations: 160 bytes)
  8.346 ns (0 allocations: 0 bytes)
  128.104 ns (3 allocations: 160 bytes)
  5.216 ns (0 allocations: 0 bytes)
  123.255 ns (3 allocations: 160 bytes)
  7.014 ns (0 allocations: 0 bytes)
  128.310 ns (3 allocations: 160 bytes)
  8.346 ns (0 allocations: 0 bytes)

Though the speedup observed here might be somewhat unrepresentative.

1 Like