Iterate and intersect on BitSet vs Int

I noticed that looping on a BitSet resulted in a significant slowdown (~10X) compared to the equivalent loop on Int64. This is resulting in undesirable performance tradeoff as intersect on the BitSets is significantly faster than on Int. I’m wondering if there wouldn’t be a strategy to get best of both world.

To put in context, the usage is for building histograms for usage in a tree algorithm. The loop goes over row indices and a value is stored in a corresponding bin in the histogram. This is where row indices as BitSet performs not good:

Data setup:

using BenchmarkTools
using StatsBase: sample

nrows = 80_000
ncols = 100
nbins = 32

id1_int = sample(1:nrows, nrows, replace=false, ordered=true)
id2_int = sample(1:nrows, Int(nrows/2), replace=false, ordered=false)

id1_bit = BitSet(sample(1:nrows, nrows, replace=false, ordered=true));
id2_bit = BitSet(sample(1:nrows, Int(nrows/2), replace=false, ordered=false));

hist = zeros(32)
x_bin = sample(UInt8.(1:nbins), nrows, replace=true, ordered=false)
value = rand(nrows)

The histogram loop:

function hist_sum(x::Vector{S}, hist::Vector{T}, set::I, value::Vector{T}) where {S,T,I}
    hist .*= 0
    for i in set
        hist[x[i]] += value[i]

@btime hist_sum($x_bin, $hist, $id1_int, $value)
@btime hist_sum($x_bin, $hist, $id1_bit, $value)
  76.900 μs (0 allocations: 0 bytes)
  445.399 μs (0 allocations: 0 bytes)

But issue is that the intersect of two ensembles (used to split the indices into the two child nodes) resulted in opposite performance:

function inter(x1, x2)
    res = intersect(x1, x2)
    return res

@btime inter($id1_int, $id2_int)
@btime inter($id1_bit, $id2_bit)
  8.987 ms (36 allocations: 2.69 MiB)
  910.000 ns (4 allocations: 10.03 KiB)

Any hint as to how to get performance on both the iteration and the intersect? Converting the BitSet to Int for the loop does improve performance, but also results in allocations that impairs the perormance.

I suspect that what you are running into is simply part of the design: for BitSet, the representation is packed bits, so you can do all binary operations (union, intersect, setdiff) very fast, but operations that require access for individual bits are slower because one has to unpack.

Choosing between representations is something you should do after benchmarking with “typical” data you expect. Since your code is generic for the set, this should be easy to do.

Incidentally, you don’t need type parameters S,T,I for hist_sum, and you could use an AbstractVector or duck typing for the other arguments.

1 Like

Sidenote. This is pretty odd:

This is equal to

hist .= hist .* 0

Why multiply each element with zero instead of just setting it to zero: hist .= 0? If you’re trying to optimize performance this seems very counterproductive. On my computer it’s 100-1000 times slower!

julia> @btime r .= 0 setup=(r = bitrand(10^5));
  126.072 ns (0 allocations: 0 bytes)

julia> @btime r .*= 0 setup=(r = bitrand(10^5));
  83.230 μs (1 allocation: 4.19 KiB)

There is a slightly different version which, with your data, reduces significantly the performance gap for sets:

function hist_sum!(x::Vector, hist::Vector, set, value::Vector) 
    hist .= 0
    for i in 1:last(set)
        if i in set
            hist[x[i]] += value[i]

But you sample data doesn’t seem typical, as id1_bit is simply the list of numbers 1:nrows

1 Like

On top of that, multiplication isn’t guaranteed to set the elements to 0:

julia> (-Inf, Inf, NaN) .* 0
(NaN, NaN, NaN)
1 Like

For the record, there was room for optimizing BitSet iteration, I just made a PR to Julia. It still doesn’t yield as good performance as with the function in my last answer for your example though.


Thanks, I was unaware of the packed/unpacked notion, so couldn’t understand the reason for the iteration speed difference.

I came to use the parametric approach for function arguments, in this case, I would have used AbstractVector{T}… Is such approach considered bad design? Or just unnecessarily restrict the function applicability?

Effectively the .= would be sufficient in the case of a vector.
However, in my actual use case, the hist is a Vector{SVector{L, Float64}}. In such case, the .= doesn’t do it. Also, taking a look at the comparison for .*= and .=, the performance isn’t appearing as dramatically different (at least in my use case):

function zeroise1(hist)
    hist .*= 0.0
function zeroise2(hist)
    hist .= 0.0

hist_v = zeros(32)
@btime zeroise1($hist_v)
@btime zeroise2($hist_v)
  6.199 ns (0 allocations: 0 bytes)
  4.199 ns (0 allocations: 0 bytes)

Effectively, is actual application the id1_int could be non contiguous, making the 1:length(set) non applicable.

Awesome! This would bring significant speed improvements while avoiding the speed bumps from gc that came with the allocations from BitSet to Int.

You should be able to use

hist .= zero(eltype(hist)) 

in that case.

The difference appears to be particularly dramatic for bitarrays.

1 Like
function zeroise3(hist)
    hist .= [zero(eltype(hist))]

hist_s = zeros(SVector{2, Float64}, 32)
@btime zeroise1($hist_s)
@btime zeroise3($hist_s)
  18.837 ns (0 allocations: 0 bytes)
  38.849 ns (1 allocation: 96 bytes)

I had to add the [] for it to work. In case of Vectror{SVector}}, it seems like .*= remains most effective.

I didn’t test it, but you definitely should not use[]. That causes allocations. Use Ref() or (,) instead.

There’s just no way .*= could be the fastest, that makes no sense.

Maybe try fill!(hist, zero(eltype(hist)) or some variation on that (I’m not on my computer right now)

Regarding the slowness of BitSet iteration, we forgot that BitSet exists when we micro-optimized the logical indexing code (probably my fault). We should open an issue or make a PR for this.

The logical indexing code is pretty hand-crafted assembly, and is faster because it avoids TZCNT inside the dependency chain (only BLSR in there), such that all the TZCNT are done in parallel.

julia> id2_int = sample(1:nrows, Int(nrows/2), replace=false, ordered=false);
julia> id2_bit = BitSet(id2_int);
julia> id2_bv = falses(nrows); for i in id2_bit id2_bv[i]=true; end;
julia> id2_li = Base.LogicalIndex(id2_bv);

julia> @btime hist_sum($x_bin, $hist, $id2_int, $value)
  149.624 μs (0 allocations: 0 bytes)
julia> @btime hist_sum($x_bin, $hist, $id2_bit, $value)
  420.125 μs (0 allocations: 0 bytes)
julia> @btime hist_sum($x_bin, $hist, $id2_li, $value)
  114.102 μs (0 allocations: 0 bytes)

Edit: @rfourquet already made the PR, yay!