Xor performance

Hello. I am developing a method that uses a compact representation of bitwise arrays instead of using boolean or even float vectors. One of the advantages, besides compression, is the xor operation that operates efficiently in the cpu architecture. However, using julia, I have some questions regarding performance and how to improve. I will leave here some observations and below the code with the times:
1). * Operation between a vector and a float matrix is ​​more efficient than an x ​​operation with a bit vector and a bit matrix
2) an operation * between two float arrays is more efficient than an operation xor a bit array and a bit array
3) To speed up xor operation between a bit vector and a bit array, it is best to replicate the array size vector and do the operation. However, replication wastes time and memory.

These issues are making it impossible for me to do coding and operation by bit arrays. Although more compact, it gets slower orders of magnitude.

Code:
‘’’
using BenchmarkTools
using Random
rng = MersenneTwister(1234)

vetbit = bitrand(rng,1,64) # 1 x 64
vetmatbit = repeat(vetbit,6400,1) # 6400 x 64
matbit = bitrand(rng,6400,64) # 6400 x 64

vetbool = rand(rng,Bool,1,64) # 1 x 64
vetmatbool = repeat(vetbool,6400,1) # 6400 x 64
matbool = rand(rng,Bool,6400,64) # 6400 x 64

matfloat = rand(rng,Float32,6400,64) # 6400x64
vetfloat = rand(rng,Float32,1,64) # 1x64

mat2float = rand(rng,Float32,64,64) # 1x64

println(“xor and bitarrays********”)
println(“vetbit xor matbit time:”)
@btime vetbit .⊻ matbit #res: 6400 x 64
println(“matbit xor matbit time:”)
@btime vetmatbit .⊻ matbit #res: 6400 x 64
println(“xor and boolarrays********”)
println(“vetbool xor matbool time:”)
@btime vetbool .⊻ matbool #res: 6400 x 64
println(“matbool xor matbool time:”)
@btime vetmatbool .⊻ matbool #res: 6400 x 64
println("************ .* and float32 arrays********************")
println(“vetfloat .* matfloat:”)
@btime vetfloat .* matfloat #res: 6400 x 64
println("************ * and float32 arrays********************")
println(“matfloat * mat2float:”)
@btime matfloat * mat2float #res: 6400 x 64
‘’’
Execution times:

xor and bitarrays********
vetbit xor matbit time:
1.196 ms (6 allocations: 54.36 KiB)
matbit xor matbit time:
4.603 μs (5 allocations: 50.17 KiB)
xor and boolarrays********
vetbool xor matbool time:
526.435 μs (6 allocations: 54.36 KiB)
matbool xor matbool time:
408.657 μs (6 allocations: 54.36 KiB)
************ .* and float32 arrays********************
vetfloat .* matfloat:
161.289 μs (4 allocations: 1.56 MiB)
************ * and float32 arrays********************
matfloat * mat2float:
475.677 μs (2 allocations: 1.56 MiB)

Wrap your code in three backticks “```”.

If you want to optimize code working on bits, I think it would be a lot easier to just work with unsigned integers than BitArrays.
If you’re working on bits, a lot of things are slow (like accessing an individual element), but many can take extreme advantage of SIMD (like xor). When optimizing code, it’s easier not to have that abstracted away. With UInts, we can write a simple for loop, and it will be fast.

Anyway, some setup:

using BenchmarkTools, Random
Random.seed!(1234);

vetbit = bitrand(1,64);
matbit = bitrand(6400,1);

res_bits = BitArray(undef, 6400, 64);
res_bits2 = similar(res_bits);

We can extract the chunks from the BitArrays; these chunks are vectors of UInt.

res_bits2_chunks = reshape(res_bits2.chunks, (100, 64));
vetbit_chunks = vetbit.chunks;
matbit_chunks = matbit.chunks;

And write a few loops working on the unsigned integers:

function broadcast_xor_bitrow_with_bitcol!( C, a, b )
    @inbounds for i in eachindex(a)
        ai = a[i]
        for bit in 0:8sizeof(ai)-1
            m = (ai & (one(UInt) << bit)) == zero(UInt) ? zero(UInt) : zero(UInt) - one(UInt)
            col = 1 + (i-1)*8sizeof(ai) + bit
            for j in 1:size(C,1)
                C[j, col] = m ⊻ b[j]
            end
        end
    end
    C
end

Of course, this code assumes everything is a multiple of 8sizeof(UInt) (is, 64 on a 64 bit machine). So you may need to modify your real code (eg, add some checks and early stopping), but this was true in your convenient example.
The line:
m = (ai & (one(UInt) << bit)) == zero(UInt) ? zero(UInt) : zero(UInt) - one(UInt)
checks to see if the bit numbered bit (bit == 0 is the rightmost) from ai equals zero, and then assigns to m either zero(UInt), or zero(UInt) - one(UInt)

julia> bitstring(zero(UInt))
"0000000000000000000000000000000000000000000000000000000000000000"

julia> bitstring(zero(UInt) - one(UInt))
"1111111111111111111111111111111111111111111111111111111111111111"

Then it uses these to xor at least 64 bits at a time. At least this much, because on my computer this code is actually vectorized further by the compiler to do 2048 bits at a time by using 4x 512-bit SIMD instructions per loop iteration.

Now, we can test it:

julia> res_bits .= vetbit .⊻ matbit;

julia> broadcast_xor_bitrow_with_bitcol!(res_bits2_chunks, vetbit_chunks, matbit_chunks);

julia> all(res_bits .== res_bits2)
true

We get the same answer!
Now, benchmark:

julia> @btime $res_bits .= $vetbit .⊻ $matbit;
  910.931 μs (1 allocation: 4.19 KiB)

julia> @btime broadcast_xor_bitrow_with_bitcol!($res_bits2_chunks, $vetbit_chunks, $matbit_chunks);
  745.148 ns (0 allocations: 0 bytes)

Over 1,200x faster!

Of course, it would be nice if broadcasting handled this automatically.

9 Likes

wow, I just have to thank you for the quick solution and code suggestion. This optimization will make the method viable. Interestingly, I will not need a mixture of Fortran or C with julia. Thnaks.

5 Likes

I think this kind of implementation could turn a package or even in the future be embedded in the xor operator.