I was playing around with flipping numbers of an array, and I was surprised that doing so with BitArrays is much slower. I think the code below is pretty clear,
using BenchmarkTools
using Random
a = rand(-1:2:1, 26, 11)
c = bitrand(26, 11)
b = 2 * c .- 1
function loop(a, b)
for i in eachindex(a)
if b[i] == 1
a[i] *= -1
end
end
end
function loop2(a, c)
for i in eachindex(a)
if c[i] == true
a[i] *= -1
end
end
end
loop(a, b)
loop2(a, c)
@btime loop(a, b);
@btime loop2(a, c);
loop(a,b) gives me about 55 ns, and loop2(a,c) gives about 200 ns.
Why would this happen? And what is the absolute fastest way to do this loop?
Unless you are memory bound (which usually isnβt the case for non-vectorized loops), bitarray will almost always be slower. You can do things better if you operate directly on the underlying array.
AFAICT the bitarray type doesnβt provide a public generic interface to operate on the underlying array. Fortunately, there arenβt maany operations that can benefit from this since there are only two values each element can take. The particular thing you want to do should be doable with map!(!, a, a).
Parallelization is really only bad if youβre assigning into BitArrays, and even then weβve changed the semantics of @simd such that it doesnβt perform the bad transform we have run into in the past.
BitArrays are a huge win when you can exploit their Int64 storage and not need to worry about picking out individual bits. Broadcasting now does this with the basic logic operators when possible. Or it can also be a slight win if the size difference is enough to keep you in cache. In any case, itβs an easy place to benchmark and profile between the two implementations if itβs particularly important.
Theyβre also a huge win if you can fit the (parts of the) array (you need) entirely into cache, where you might not be able to otherwise. Weβve seen in LightGraphs that the improvements in performance due to keeping bitarrays in cache outweigh the performance improvements for a vector of booleans for very large graphs where weβre keeping track of whether a vertex has been visited.
Yet another weird performance of an alternative function. The following loop3(a,b) doesnβt use conditions at all, but relies on accessing all the indices of a and b in memory order. For small a and b, this is slower than loop(a,b) which uses conditions, but as a and b become large enough, they have the same performance. Is accessing an array element in memory order more expensive than a conditional?
function loop3(a,b)
for i in eachindex(a)
a[i] *= -b[i]
end
a
end
Sorry to revive an old topic. But has any of this changed in more recent versions of Julia?
Do we expect BitArray to be slow and should use them only if we are memory constrained?
Basically, if the size of your vector/array of boolean values is not an issue in and of itself, use Vector{Bool} where each byte holds one boolean value. If the size is an issue and 1/8th the size is preferable, use BitVector where each byte holds eight boolean values and accept the time overhead.
Saving space is also closely related to speed because the less space your data needs, the fewer cache lines are needed to store it, and the fewer fetches from main memory are needed.
In this particular case, the bitarray algorithm isnβt simdβing. You can get it to by throwing an @simd ivdep at it β and then the performance is more similar β but Iβm not 100% certain thatβs safe (see above how we had to remove this in some places specifically because of BitArrays). Thereβs probably a better way of writing it than this naive loop: