Why is BitArray so slow?

I was playing around with flipping numbers of an array, and I was surprised that doing so with BitArrays is much slower. I think the code below is pretty clear,

using BenchmarkTools
using Random

a = rand(-1:2:1, 26, 11)
c = bitrand(26, 11)
b = 2 * c .- 1

function loop(a, b)
    for i in eachindex(a)
        if b[i] == 1
            a[i] *= -1 
        end
    end
end

function loop2(a, c)
    for i in eachindex(a)
        if c[i] == true
            a[i] *= -1
        end
    end
end

    
loop(a, b)
loop2(a, c)

@btime loop(a, b);
@btime loop2(a, c);
  

loop(a,b) gives me about 55 ns, and loop2(a,c) gives about 200 ns.

Why would this happen? And what is the absolute fastest way to do this loop?

I guess this comes down to a speed difference with using Int64’s vs any other kind of Int, including bits.

Unless you are memory bound (which usually isn’t the case for non-vectorized loops), bitarray will almost always be slower. You can do things better if you operate directly on the underlying array.

AFAICT the bitarray type doesn’t provide a public generic interface to operate on the underlying array. Fortunately, there aren’t maany operations that can benefit from this since there are only two values each element can take. The particular thing you want to do should be doable with map!(!, a, a).

2 Likes

I cannot reproduce. That being said, CMOV is your friend:

julia> N=10_000; a = rand(-1:2:1, N); c=bitrand(N); b = 2 * c .- 1; d=collect(c);
julia> function loop3(a,c)
       @simd for i in eachindex(a)
       @inbounds a[i]=ifelse(c[i], -a[i], a[i])
       end
       nothing
       end
julia> @btime loop(a,b);
  56.643 ΞΌs (0 allocations: 0 bytes)
julia> @btime loop2(a,c);
  61.410 ΞΌs (0 allocations: 0 bytes)
julia> @btime loop3(a,c);
  13.170 ΞΌs (0 allocations: 0 bytes)
julia> @btime loop3(a,d);
  4.960 ΞΌs (0 allocations: 0 bytes)

FWIW, @simd is illegal with BitArray.

1 Like

Actually, after `@simd` breaks generic code by vchuravy Β· Pull Request #27670 Β· JuliaLang/julia Β· GitHub, @simd should be fine, @simd ivdep isn’t.

Parallelization is really only bad if you’re assigning into BitArrays, and even then we’ve changed the semantics of @simd such that it doesn’t perform the bad transform we have run into in the past.

BitArrays are a huge win when you can exploit their Int64 storage and not need to worry about picking out individual bits. Broadcasting now does this with the basic logic operators when possible. Or it can also be a slight win if the size difference is enough to keep you in cache. In any case, it’s an easy place to benchmark and profile between the two implementations if it’s particularly important.

3 Likes

They’re also a huge win if you can fit the (parts of the) array (you need) entirely into cache, where you might not be able to otherwise. We’ve seen in LightGraphs that the improvements in performance due to keeping bitarrays in cache outweigh the performance improvements for a vector of booleans for very large graphs where we’re keeping track of whether a vertex has been visited.

2 Likes

Yet another weird performance of an alternative function. The following loop3(a,b) doesn’t use conditions at all, but relies on accessing all the indices of a and b in memory order. For small a and b, this is slower than loop(a,b) which uses conditions, but as a and b become large enough, they have the same performance. Is accessing an array element in memory order more expensive than a conditional?

function loop3(a,b)
    for i in eachindex(a) 
        a[i] *= -b[i]
    end  
    a
end

this is great! thanks, but it’s still more than 4x faster using an array of Int64’s,

Sorry to revive an old topic. But has any of this changed in more recent versions of Julia?
Do we expect BitArray to be slow and should use them only if we are memory constrained?

This looks related. Why is the conversion from BitArray so much slower than from Array{Bool}?

julia> @benchmark convert(Array{Float32}, b) setup=(b=Array(bitrand(28,28,128));)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  30.867 ΞΌs … 591.739 ΞΌs  β”Š GC (min … max): 0.00% … 74.02%
 Time  (median):     35.097 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   38.258 ΞΌs Β±  31.908 ΞΌs  β”Š GC (mean Β± Οƒ):  5.89% Β±  6.67%

        β–„β–ˆβ–ˆβ–‡β–…β–„β–…β–„β–‚β–                                              
  β–‚β–β–‚β–ƒβ–ƒβ–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–…β–„β–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–‚β–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚ β–ƒ
  30.9 ΞΌs         Histogram: frequency by time         53.9 ΞΌs <

 Memory estimate: 392.06 KiB, allocs estimate: 2.

julia> @benchmark convert(Array{Float32}, b) setup=(b=bitrand(28,28,128);)
BenchmarkTools.Trial: 8586 samples with 1 evaluation.
 Range (min … max):  515.307 ΞΌs …  1.291 ms  β”Š GC (min … max): 0.00% … 40.74%
 Time  (median):     556.879 ΞΌs              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   576.900 ΞΌs Β± 66.623 ΞΌs  β”Š GC (mean Β± Οƒ):  0.63% Β±  3.89%

  β–…β–„ β–†β–ˆβ–„β–‚β–†β–…β–ƒβ–β–…β–ƒβ–‚ β–„β–ƒβ–‚ ▂▄▂▁ ▁▄▂▂▁                                β–‚
  β–ˆβ–ˆβ–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–ˆβ–ˆβ–ˆβ–‡β–ˆβ–ˆβ–ˆβ–ˆβ–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–ˆβ–‡β–‡β–‡β–‡β–‡β–†β–†β–†β–‡β–…β–…β–…β–†β–‡β–‡β–„β–„β–„β–†β–†β–†β–…β–…β–ƒβ–…β–„β–„β–„β–ƒ β–ˆ
  515 ΞΌs        Histogram: log(frequency) by time       850 ΞΌs <

 Memory estimate: 392.06 KiB, allocs estimate: 2.

BitArrays pack bits into Int arrays AFAIK. I suspect that the performance hit is from unpacking.

Basically, if the size of your vector/array of boolean values is not an issue in and of itself, use Vector{Bool} where each byte holds one boolean value. If the size is an issue and 1/8th the size is preferable, use BitVector where each byte holds eight boolean values and accept the time overhead.

BitArrays are extremely fast for operations where unpacking is unnecessary, such as count/sum, .!, etc. It’s not just about saving memory.

2 Likes

Saving space is also closely related to speed because the less space your data needs, the fewer cache lines are needed to store it, and the fewer fetches from main memory are needed.

1 Like

In this particular case, the bitarray algorithm isn’t simd’ing. You can get it to by throwing an @simd ivdep at it β€” and then the performance is more similar β€” but I’m not 100% certain that’s safe (see above how we had to remove this in some places specifically because of BitArrays). There’s probably a better way of writing it than this naive loop:

https://github.com/JuliaLang/julia/blob/4f1ff0bb0055806fcce880765c0be26313f936bf/base/bitarray.jl#L497-L499

3 Likes

The problem is that broadcasting ops like:

julia> randn(10) .< 0
10-element BitVector:
 1
 1
 1
 1
 1
 0
 0
 1
 0
 1

return a BitArray by default, not Array{Bool}.
So you are forced to do a conversion afterwards. This hurts if the conversion is slow.

But, as mentioned, BitArray is not, in general, slow. In many cases it is blazing fast:

x = rand(1000)
ind = randn(1000) .< 0.5
indbool = Vector{Bool}(ind)

1.7.0> @btime $x[$ind];
  788.060 ns (1 allocation: 5.62 KiB)

1.7.0> @btime $x[$indbool];
  1.290 ΞΌs (1 allocation: 5.62 KiB)

1.7.0> @btime count($ind);
  6.000 ns (0 allocations: 0 bytes)

1.7.0> @btime count($indbool);
  22.088 ns (0 allocations: 0 bytes)

1.7.0> @btime .!($ind);
  42.020 ns (2 allocations: 224 bytes)

1.7.0> @btime .!($indbool);
  548.958 ns (3 allocations: 4.41 KiB)

But you need to avoid accessing individual elements.

3 Likes

Yep. But matrix multiplies are slow, so you have to convert the BitArray into something BLAS likes.