Xor performance

Elrod · October 12, 2019, 8:11pm

Wrap your code in three backticks “```”.

If you want to optimize code working on bits, I think it would be a lot easier to just work with unsigned integers than BitArrays.
If you’re working on bits, a lot of things are slow (like accessing an individual element), but many can take extreme advantage of SIMD (like xor). When optimizing code, it’s easier not to have that abstracted away. With UInts, we can write a simple for loop, and it will be fast.

Anyway, some setup:

using BenchmarkTools, Random
Random.seed!(1234);

vetbit = bitrand(1,64);
matbit = bitrand(6400,1);

res_bits = BitArray(undef, 6400, 64);
res_bits2 = similar(res_bits);

We can extract the chunks from the BitArrays; these chunks are vectors of UInt.

res_bits2_chunks = reshape(res_bits2.chunks, (100, 64));
vetbit_chunks = vetbit.chunks;
matbit_chunks = matbit.chunks;

And write a few loops working on the unsigned integers:

function broadcast_xor_bitrow_with_bitcol!( C, a, b )
    @inbounds for i in eachindex(a)
        ai = a[i]
        for bit in 0:8sizeof(ai)-1
            m = (ai & (one(UInt) << bit)) == zero(UInt) ? zero(UInt) : zero(UInt) - one(UInt)
            col = 1 + (i-1)*8sizeof(ai) + bit
            for j in 1:size(C,1)
                C[j, col] = m ⊻ b[j]
            end
        end
    end
    C
end

Of course, this code assumes everything is a multiple of 8sizeof(UInt) (is, 64 on a 64 bit machine). So you may need to modify your real code (eg, add some checks and early stopping), but this was true in your convenient example.
The line:
m = (ai & (one(UInt) << bit)) == zero(UInt) ? zero(UInt) : zero(UInt) - one(UInt)
checks to see if the bit numbered bit (bit == 0 is the rightmost) from ai equals zero, and then assigns to m either zero(UInt), or zero(UInt) - one(UInt)

julia> bitstring(zero(UInt))
"0000000000000000000000000000000000000000000000000000000000000000"

julia> bitstring(zero(UInt) - one(UInt))
"1111111111111111111111111111111111111111111111111111111111111111"

Then it uses these to xor at least 64 bits at a time. At least this much, because on my computer this code is actually vectorized further by the compiler to do 2048 bits at a time by using 4x 512-bit SIMD instructions per loop iteration.

Now, we can test it:

julia> res_bits .= vetbit .⊻ matbit;

julia> broadcast_xor_bitrow_with_bitcol!(res_bits2_chunks, vetbit_chunks, matbit_chunks);

julia> all(res_bits .== res_bits2)
true

We get the same answer!
Now, benchmark:

julia> @btime $res_bits .= $vetbit .⊻ $matbit;
  910.931 μs (1 allocation: 4.19 KiB)

julia> @btime broadcast_xor_bitrow_with_bitcol!($res_bits2_chunks, $vetbit_chunks, $matbit_chunks);
  745.148 ns (0 allocations: 0 bytes)

Over 1,200x faster!

Of course, it would be nice if broadcasting handled this automatically.