Optimization: How to make sure XOR is performed in chunks

If you compute the Hamming distance between two 64 bit vectors represented as two UInt64 values that is one XOR and one popcount assembly instruction. That is most likely less than a nanosecond in total.

Ok, that makes sense. But that is, AFAIK, exactly what bitvectors do

If you look at the assembly you get from the function hamming you can see it is doing something much more expensive.

That is odd, because:

julia> a = bitrand(64*3);

julia> a.chunks
3-element Array{UInt64,1}:
 0x223aff951d3eabc4
 0x3f59f423778c8580
 0xdcf4b11767dcb777

Bitvectors are just UInt64s under the hood. That is what’s supposed to make them fast.

1 Like

To quote someone else:
"
If you use an UInt64 instead of a BitVector, a few things happen:

  • Uints are stored directly in the arrays instead of pointers being stored
  • No iteration is needed in hamming . The hamming boils down to only an xor and a popcnt .
  • find_min can vectorize, running hamming in parallel.
    "

I don’t understand why bitvectors don’t do the same thing. They too are just wrapped arrays of UInt64. Iteration is needed if you have an array of UInt64s, surely.

I don’t see why they are different. Something to do with broadcasting?

Maybe differs on the CPU if the AVX is worth it. Was quite a lot faster for me.

Yes you would need to iterate if you have more than 128 bits. Luckily there is also UIint128 for n < 129.

It is (basically) the same as long as you use broadcast = as well. But doing a single xor on a pair of UInt64 is faster still (I’m assuming something to do with arrays being allocated on the heap instead of the stack?). Some timings:

julia> x = bitrand(64);

julia> y = bitrand(64);

julia> z = similar(x);

julia> @btime $z .= $x .⊻ $y;
  9.309 ns (0 allocations: 0 bytes)

julia> x = rand(UInt64, 1);

julia> y = rand(UInt64, 1);

julia> z = similar(x);

julia> @btime $z .= $x .⊻ $y;
  9.021 ns (0 allocations: 0 bytes)

julia> x = rand(UInt64);

julia> y = rand(UInt64);

julia> @btime $x ⊻ $y;
  0.019 ns (0 allocations: 0 bytes)

julia> x = rand(UInt128);

julia> y = rand(UInt128);

julia> @btime $x ⊻ $y
  0.019 ns (0 allocations: 0 bytes)

EDIT: for the record, @inbounds made the timings worse for the BitVector and Vector{UInt64} cases on my computer as well.

EDIT2: The real question I have is why using broadcast!(xor, z, x, y) is slower than z .= xor.(x, y):

julia> x = bitrand(64);

julia> y = bitrand(64);

julia> z = similar(x);

julia> @btime broadcast!(⊻, $z, $x, $y);
  18.687 ns (0 allocations: 0 bytes)

julia> x = rand(UInt64, 1);

julia> y = rand(UInt64, 1);

julia> z = similar(x);

julia> @btime broadcast!($⊻, $z, $x, $y);
  16.280 ns (0 allocations: 0 bytes)

Okay, but generally you have more, or rather, an unknown number of samples, I would think.

As for the sub-nanosecond timings, those are not real, when that happens, it’s the compiler eliding the entire computation.

Right of course:

julia> x = Ref(rand(UInt64));

julia> y = Ref(rand(UInt64));

julia> @btime $x[] ⊻ $y[];
  1.750 ns (0 allocations: 0 bytes)

julia> x = Ref(rand(UInt128));

julia> y = Ref(rand(UInt128));

julia> @btime x[] ⊻ $y[];
  31.367 ns (2 allocations: 64 bytes)

@lesshaste the important bit here for you is that using UInt128 is slower than BitVector.

An other trick to see the real timings is to use the setup feature of BenchmarkTools:

julia> @benchmark a ⊻ b setup=((a,b)=rand(UInt,2))
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.699 ns (0.00% GC)
  median time:      1.800 ns (0.00% GC)
  mean time:        1.769 ns (0.00% GC)
  maximum time:     11.200 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000
1 Like

If you don’t plan on having more than 64 bits, then it’s certainly faster to work with UInt64 values directly. You can easily make a BitVector64 type that acts like a 64-element boolean vector but is represented as a single UInt64 value. It may or may not be worth it to have something that behaves in a vectorlike fashion. BitVectors have more overhead since they are arbitrary length.

1 Like

You forgot a crucial part:

julia> @btime x[] ⊻ $y[];
  55.262 ns (3 allocations: 96 bytes)

julia> @btime $x[] ⊻ $y[];
  1.792 ns (0 allocations: 0 bytes)
1 Like

Perhaps a StaticBitVector type that contains a tuple of Int64 would be the best of both worlds. If someone has the time and is working on this anyway, implementing

would be a great addition.

An alternative implementation idea is to use the Bits package, which arleady has a BitVector1 type, which wraps a single integer and is supposed to have a similar interface to BitVector. It’s on my radar to develop this type a bit and make it more useful. Then this can be used with integer types of any bit-size to emulate a kind of “static bit-vector”.

1 Like

A bit late to the party, but I just had a similar use case today.
The fastest method I could find was working on UInt8 instead of BitArrays, count_ones and LoopVectorization:

function hamming_distance(h1, h2)
    s = 0
    @avx for i = 1:length(h1)
        s += count_ones(xor(h1[i], h2[i]))
    end
    s
end

h1 = Vector{UInt8}(randstring(hash_length))
h2 = Vector{UInt8}(randstring(hash_length))

julia> @btime hamming_distance($h1, $h2)
  11.813 ns (0 allocations: 0 bytes)

The method above (based on the same BitVectors) is slower for me:

function make_bitvector(v::Vector{UInt8})
           siz = sizeof(v)
           bv = falses(siz<<3)
           unsafe_copyto!(reinterpret(Ptr{UInt8}, pointer(bv.chunks)), pointer(v), siz)
           bv
end

h1b = make_bitvector(h1)
h2b = make_bitvector(h2)
buf = similar(h1b)

julia> @btime hamming($h1b, $h2b, $buf)
  16.933 ns (0 allocations: 0 bytes)

Is bitwise Hamming Distance on GPU an option to get even more speed? With my very limited GPU knowledge I could not find a way to make it faster (or even the same speed) as on CPU.

Edit: btw. I was able to beat the performance of a ~150 LOC Cython implementation with less than 30 lines of Julia code :wink:

1 Like

Your approach looks good to me.

Quite late to the party, but I find that @lungben’s approach is 2x faster with

h1_64 = reinterpret(Int, h1)
h2_64 = reinterpret(Int, h2)
@btime hamming_distance($h1_64, $h2_64)

For a total of 20x speedup in the OP by replacing the @lesshaste’s hamming function with

using LoopVectorization
function hamming(bits1, bits2)
    h1, h2 = bits1.chunks, bits2.chunks
    s = 0
    @avx for i = eachindex(h1, h2)
        s += count_ones(xor(h1[i], h2[i]))
    end
    s
end
3 Likes

I get crossover with bits1, bits2 each ~1792 bits (64 * (1024+512+256))
when using @tturbo performs better than @turbo (@avx).
This is likely a cache line effect: div(2*1792, 512) == 7