Optimization: How to make sure XOR is performed in chunks

A bit late to the party, but I just had a similar use case today.
The fastest method I could find was working on UInt8 instead of BitArrays, count_ones and LoopVectorization:

function hamming_distance(h1, h2)
    s = 0
    @avx for i = 1:length(h1)
        s += count_ones(xor(h1[i], h2[i]))
    end
    s
end

h1 = Vector{UInt8}(randstring(hash_length))
h2 = Vector{UInt8}(randstring(hash_length))

julia> @btime hamming_distance($h1, $h2)
  11.813 ns (0 allocations: 0 bytes)

The method above (based on the same BitVectors) is slower for me:

function make_bitvector(v::Vector{UInt8})
           siz = sizeof(v)
           bv = falses(siz<<3)
           unsafe_copyto!(reinterpret(Ptr{UInt8}, pointer(bv.chunks)), pointer(v), siz)
           bv
end

h1b = make_bitvector(h1)
h2b = make_bitvector(h2)
buf = similar(h1b)

julia> @btime hamming($h1b, $h2b, $buf)
  16.933 ns (0 allocations: 0 bytes)

Is bitwise Hamming Distance on GPU an option to get even more speed? With my very limited GPU knowledge I could not find a way to make it faster (or even the same speed) as on CPU.

Edit: btw. I was able to beat the performance of a ~150 LOC Cython implementation with less than 30 lines of Julia code :wink:

1 Like