I noticed that looping on a BitSet resulted in a significant slowdown (~10X) compared to the equivalent loop on Int64. This is resulting in undesirable performance tradeoff as intersect on the BitSets is significantly faster than on Int. I’m wondering if there wouldn’t be a strategy to get best of both world.

To put in context, the usage is for building histograms for usage in a tree algorithm. The loop goes over row indices and a value is stored in a corresponding bin in the histogram. This is where row indices as BitSet performs not good:

Data setup:

```
using BenchmarkTools
using StatsBase: sample
nrows = 80_000
ncols = 100
nbins = 32
id1_int = sample(1:nrows, nrows, replace=false, ordered=true)
id2_int = sample(1:nrows, Int(nrows/2), replace=false, ordered=false)
id1_bit = BitSet(sample(1:nrows, nrows, replace=false, ordered=true));
id2_bit = BitSet(sample(1:nrows, Int(nrows/2), replace=false, ordered=false));
hist = zeros(32)
x_bin = sample(UInt8.(1:nbins), nrows, replace=true, ordered=false)
value = rand(nrows)
```

The histogram loop:

```
function hist_sum(x::Vector{S}, hist::Vector{T}, set::I, value::Vector{T}) where {S,T,I}
hist .*= 0
for i in set
hist[x[i]] += value[i]
end
return
end
@btime hist_sum($x_bin, $hist, $id1_int, $value)
@btime hist_sum($x_bin, $hist, $id1_bit, $value)
76.900 μs (0 allocations: 0 bytes)
445.399 μs (0 allocations: 0 bytes)
```

But issue is that the intersect of two ensembles (used to split the indices into the two child nodes) resulted in opposite performance:

```
function inter(x1, x2)
res = intersect(x1, x2)
return res
end
@btime inter($id1_int, $id2_int)
@btime inter($id1_bit, $id2_bit)
8.987 ms (36 allocations: 2.69 MiB)
910.000 ns (4 allocations: 10.03 KiB)
```

Any hint as to how to get performance on both the iteration and the intersect? Converting the BitSet to Int for the loop does improve performance, but also results in allocations that impairs the perormance.