Optimization: How to make sure XOR is performed in chunks

kristoffer.carlsson · January 30, 2020, 6:48pm

Maybe differs on the CPU if the AVX is worth it. Was quite a lot faster for me.

lesshaste · January 30, 2020, 6:49pm

Yes you would need to iterate if you have more than 128 bits. Luckily there is also UIint128 for n < 129.

non-Jedi · January 30, 2020, 8:29pm

It is (basically) the same as long as you use broadcast = as well. But doing a single xor on a pair of UInt64 is faster still (I’m assuming something to do with arrays being allocated on the heap instead of the stack?). Some timings:

julia> x = bitrand(64);

julia> y = bitrand(64);

julia> z = similar(x);

julia> @btime $z .= $x .⊻ $y;
  9.309 ns (0 allocations: 0 bytes)

julia> x = rand(UInt64, 1);

julia> y = rand(UInt64, 1);

julia> z = similar(x);

julia> @btime $z .= $x .⊻ $y;
  9.021 ns (0 allocations: 0 bytes)

julia> x = rand(UInt64);

julia> y = rand(UInt64);

julia> @btime $x ⊻ $y;
  0.019 ns (0 allocations: 0 bytes)

julia> x = rand(UInt128);

julia> y = rand(UInt128);

julia> @btime $x ⊻ $y
  0.019 ns (0 allocations: 0 bytes)

EDIT: for the record, @inbounds made the timings worse for the BitVector and Vector{UInt64} cases on my computer as well.

EDIT2: The real question I have is why using broadcast!(xor, z, x, y) is slower than z .= xor.(x, y):

julia> x = bitrand(64);

julia> y = bitrand(64);

julia> z = similar(x);

julia> @btime broadcast!(⊻, $z, $x, $y);
  18.687 ns (0 allocations: 0 bytes)

julia> x = rand(UInt64, 1);

julia> y = rand(UInt64, 1);

julia> z = similar(x);

julia> @btime broadcast!($⊻, $z, $x, $y);
  16.280 ns (0 allocations: 0 bytes)

DNF · January 30, 2020, 8:58pm

Okay, but generally you have more, or rather, an unknown number of samples, I would think.

As for the sub-nanosecond timings, those are not real, when that happens, it’s the compiler eliding the entire computation.

non-Jedi · January 30, 2020, 9:43pm

Right of course:

julia> x = Ref(rand(UInt64));

julia> y = Ref(rand(UInt64));

julia> @btime $x[] ⊻ $y[];
  1.750 ns (0 allocations: 0 bytes)

julia> x = Ref(rand(UInt128));

julia> y = Ref(rand(UInt128));

julia> @btime x[] ⊻ $y[];
  31.367 ns (2 allocations: 64 bytes)

@lesshaste the important bit here for you is that using UInt128 is slower than BitVector.

jebej · January 30, 2020, 10:11pm

An other trick to see the real timings is to use the setup feature of BenchmarkTools:

julia> @benchmark a ⊻ b setup=((a,b)=rand(UInt,2))
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.699 ns (0.00% GC)
  median time:      1.800 ns (0.00% GC)
  mean time:        1.769 ns (0.00% GC)
  maximum time:     11.200 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

StefanKarpinski · January 30, 2020, 10:34pm

If you don’t plan on having more than 64 bits, then it’s certainly faster to work with UInt64 values directly. You can easily make a BitVector64 type that acts like a 64-element boolean vector but is represented as a single UInt64 value. It may or may not be worth it to have something that behaves in a vectorlike fashion. BitVectors have more overhead since they are arbitrary length.

DNF · January 31, 2020, 8:45am

You forgot a crucial part:

julia> @btime x[] ⊻ $y[];
  55.262 ns (3 allocations: 96 bytes)

julia> @btime $x[] ⊻ $y[];
  1.792 ns (0 allocations: 0 bytes)

Tamas_Papp · January 31, 2020, 9:48am

Perhaps a StaticBitVector type that contains a tuple of Int64 would be the best of both worlds. If someone has the time and is working on this anyway, implementing

would be a great addition.

rfourquet · January 31, 2020, 10:07am

An alternative implementation idea is to use the Bits package, which arleady has a BitVector1 type, which wraps a single integer and is supposed to have a similar interface to BitVector. It’s on my radar to develop this type a bit and make it more useful. Then this can be used with integer types of any bit-size to emulate a kind of “static bit-vector”.

lungben · November 30, 2020, 5:10pm

A bit late to the party, but I just had a similar use case today.
The fastest method I could find was working on UInt8 instead of BitArrays, count_ones and LoopVectorization:

function hamming_distance(h1, h2)
    s = 0
    @avx for i = 1:length(h1)
        s += count_ones(xor(h1[i], h2[i]))
    end
    s
end

h1 = Vector{UInt8}(randstring(hash_length))
h2 = Vector{UInt8}(randstring(hash_length))

julia> @btime hamming_distance($h1, $h2)
  11.813 ns (0 allocations: 0 bytes)

The method above (based on the same BitVectors) is slower for me:

function make_bitvector(v::Vector{UInt8})
           siz = sizeof(v)
           bv = falses(siz<<3)
           unsafe_copyto!(reinterpret(Ptr{UInt8}, pointer(bv.chunks)), pointer(v), siz)
           bv
end

h1b = make_bitvector(h1)
h2b = make_bitvector(h2)
buf = similar(h1b)

julia> @btime hamming($h1b, $h2b, $buf)
  16.933 ns (0 allocations: 0 bytes)

Is bitwise Hamming Distance on GPU an option to get even more speed? With my very limited GPU knowledge I could not find a way to make it faster (or even the same speed) as on CPU.

Edit: btw. I was able to beat the performance of a ~150 LOC Cython implementation with less than 30 lines of Julia code

JeffreySarnoff · December 2, 2020, 4:44pm

Your approach looks good to me.

Lilith · July 6, 2022, 6:41pm

Quite late to the party, but I find that @lungben’s approach is 2x faster with

h1_64 = reinterpret(Int, h1)
h2_64 = reinterpret(Int, h2)
@btime hamming_distance($h1_64, $h2_64)

For a total of 20x speedup in the OP by replacing the @lesshaste’s hamming function with

using LoopVectorization
function hamming(bits1, bits2)
    h1, h2 = bits1.chunks, bits2.chunks
    s = 0
    @avx for i = eachindex(h1, h2)
        s += count_ones(xor(h1[i], h2[i]))
    end
    s
end

JeffreySarnoff · July 6, 2022, 7:18pm

I get crossover with bits1, bits2 each ~1792 bits (64 * (1024+512+256))
when using @tturbo performs better than @turbo (@avx).
This is likely a cache line effect: div(2*1792, 512) == 7

Topic		Replies	Views
Hand written loop slower than bitVector broadcast Performance question	15	725	January 18, 2024
Performance hacking `(&)(a::BitArray, b::BitArray)` Performance	14	233	October 22, 2024
How to create a random binary array with b bits? New to Julia	15	4385	January 29, 2020
BitVector vs Vector{Bool} as default on comparison operations Performance	8	9213	November 2, 2020
Understanding why one function is faster than the other New to Julia	16	1202	November 1, 2020

Optimization: How to make sure XOR is performed in chunks

Related topics