Why mul! is so fast with BitVector?

Consider the following example code:

using LinearAlgebra, BenchmarkTools

n = 200
xb = BitVector(rand(Bool, n))
xf = Float32.(xb)
W = randn(Float32,n,n)
y = zeros(Float32,n);

function mymul!(y, A, x)
    y .= 0
    @inbounds for j = 1:length(x)
        @simd for i = 1:length(y)
            #y[i] = muladd(A[i,j], x[j], y[i])
            y[i] += A[i,j] * x[j]
        end
    end
end

Here I find the following benchmarks:

@btime mymul!($y, $W, $xf);   # 4.542 μs (0 allocations: 0 bytes)
@btime mul!($y, $W, $xf);     # 7.541 μs (0 allocations: 0 bytes)
@btime mymul!($y, $W, $xb);   # 15.586 μs (0 allocations: 0 bytes)  (why?)
@btime mul!($y, $W, $xb);     # 7.864 μs (0 allocations: 0 bytes)

So why is mymul! so slow when it is fed with a BitVector, as compared to mul! from LinearAlgebra?

Note that this question is very similar to Why mul! is so fast?. But in that thread I preferred to focus on Float64 arguments to simplify. Here I am referring to a difference specifically for BitVector arguments. I expect that the reason here will be different that in that thread because of the way the bools are stored as bits in a BitVector. Also note that I didn’t find such a big difference here when using Float64 instead of Float32, and I’m not sure how relevant that is.

It’s probably the conversion from BitVector to Float32. Here’s another way to do it.

function mymul!(y, A, x::BitVector)
    y .= 0
    for j in 1:length(x)
        @inbounds x[j] && (y .+= @view A[:,j])
    end
end

Quite a bit faster.

1 Like