Optimization: How to make sure XOR is performed in chunks

lungben · November 30, 2020, 5:10pm

A bit late to the party, but I just had a similar use case today.
The fastest method I could find was working on UInt8 instead of BitArrays, count_ones and LoopVectorization:

function hamming_distance(h1, h2)
    s = 0
    @avx for i = 1:length(h1)
        s += count_ones(xor(h1[i], h2[i]))
    end
    s
end

h1 = Vector{UInt8}(randstring(hash_length))
h2 = Vector{UInt8}(randstring(hash_length))

julia> @btime hamming_distance($h1, $h2)
  11.813 ns (0 allocations: 0 bytes)

The method above (based on the same BitVectors) is slower for me:

function make_bitvector(v::Vector{UInt8})
           siz = sizeof(v)
           bv = falses(siz<<3)
           unsafe_copyto!(reinterpret(Ptr{UInt8}, pointer(bv.chunks)), pointer(v), siz)
           bv
end

h1b = make_bitvector(h1)
h2b = make_bitvector(h2)
buf = similar(h1b)

julia> @btime hamming($h1b, $h2b, $buf)
  16.933 ns (0 allocations: 0 bytes)

Is bitwise Hamming Distance on GPU an option to get even more speed? With my very limited GPU knowledge I could not find a way to make it faster (or even the same speed) as on CPU.

Edit: btw. I was able to beat the performance of a ~150 LOC Cython implementation with less than 30 lines of Julia code

Topic		Replies	Views
How to create a random binary array with b bits? New to Julia	15	4422	January 29, 2020
BitVector vs Vector{Bool} as default on comparison operations Performance	8	9333	November 2, 2020
Understanding why one function is faster than the other New to Julia	16	1218	November 1, 2020
Fastest way to count unique elements in `Vector{Union{Bool, Missing})` Performance	35	6939	January 16, 2018
Xor performance Performance question	3	2515	October 18, 2019

Optimization: How to make sure XOR is performed in chunks

Related topics