Optimization: How to make sure XOR is performed in chunks

Right of course:

julia> x = Ref(rand(UInt64));

julia> y = Ref(rand(UInt64));

julia> @btime $x[] ⊻ $y[];
  1.750 ns (0 allocations: 0 bytes)

julia> x = Ref(rand(UInt128));

julia> y = Ref(rand(UInt128));

julia> @btime x[] ⊻ $y[];
  31.367 ns (2 allocations: 64 bytes)

@lesshaste the important bit here for you is that using UInt128 is slower than BitVector.