Maybe differs on the CPU if the AVX is worth it. Was quite a lot faster for me.
Yes you would need to iterate if you have more than 128 bits. Luckily there is also UIint128 for n < 129.
It is (basically) the same as long as you use broadcast =
as well. But doing a single xor
on a pair of UInt64
is faster still (I’m assuming something to do with arrays being allocated on the heap instead of the stack?). Some timings:
julia> x = bitrand(64);
julia> y = bitrand(64);
julia> z = similar(x);
julia> @btime $z .= $x .⊻ $y;
9.309 ns (0 allocations: 0 bytes)
julia> x = rand(UInt64, 1);
julia> y = rand(UInt64, 1);
julia> z = similar(x);
julia> @btime $z .= $x .⊻ $y;
9.021 ns (0 allocations: 0 bytes)
julia> x = rand(UInt64);
julia> y = rand(UInt64);
julia> @btime $x ⊻ $y;
0.019 ns (0 allocations: 0 bytes)
julia> x = rand(UInt128);
julia> y = rand(UInt128);
julia> @btime $x ⊻ $y
0.019 ns (0 allocations: 0 bytes)
EDIT: for the record, @inbounds
made the timings worse for the BitVector
and Vector{UInt64}
cases on my computer as well.
EDIT2: The real question I have is why using broadcast!(xor, z, x, y)
is slower than z .= xor.(x, y)
:
julia> x = bitrand(64);
julia> y = bitrand(64);
julia> z = similar(x);
julia> @btime broadcast!(⊻, $z, $x, $y);
18.687 ns (0 allocations: 0 bytes)
julia> x = rand(UInt64, 1);
julia> y = rand(UInt64, 1);
julia> z = similar(x);
julia> @btime broadcast!($⊻, $z, $x, $y);
16.280 ns (0 allocations: 0 bytes)
Okay, but generally you have more, or rather, an unknown number of samples, I would think.
As for the sub-nanosecond timings, those are not real, when that happens, it’s the compiler eliding the entire computation.
Right of course:
julia> x = Ref(rand(UInt64));
julia> y = Ref(rand(UInt64));
julia> @btime $x[] ⊻ $y[];
1.750 ns (0 allocations: 0 bytes)
julia> x = Ref(rand(UInt128));
julia> y = Ref(rand(UInt128));
julia> @btime x[] ⊻ $y[];
31.367 ns (2 allocations: 64 bytes)
@lesshaste the important bit here for you is that using UInt128
is slower than BitVector
.
An other trick to see the real timings is to use the setup
feature of BenchmarkTools:
julia> @benchmark a ⊻ b setup=((a,b)=rand(UInt,2))
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 1.699 ns (0.00% GC)
median time: 1.800 ns (0.00% GC)
mean time: 1.769 ns (0.00% GC)
maximum time: 11.200 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 1000
If you don’t plan on having more than 64 bits, then it’s certainly faster to work with UInt64 values directly. You can easily make a BitVector64 type that acts like a 64-element boolean vector but is represented as a single UInt64 value. It may or may not be worth it to have something that behaves in a vectorlike fashion. BitVectors have more overhead since they are arbitrary length.
You forgot a crucial part:
julia> @btime x[] ⊻ $y[];
55.262 ns (3 allocations: 96 bytes)
julia> @btime $x[] ⊻ $y[];
1.792 ns (0 allocations: 0 bytes)
Perhaps a StaticBitVector
type that contains a tuple of Int64
would be the best of both worlds. If someone has the time and is working on this anyway, implementing
would be a great addition.
An alternative implementation idea is to use the Bits
package, which arleady has a BitVector1
type, which wraps a single integer and is supposed to have a similar interface to BitVector
. It’s on my radar to develop this type a bit and make it more useful. Then this can be used with integer types of any bit-size to emulate a kind of “static bit-vector”.
A bit late to the party, but I just had a similar use case today.
The fastest method I could find was working on UInt8 instead of BitArrays, count_ones
and LoopVectorization:
function hamming_distance(h1, h2)
s = 0
@avx for i = 1:length(h1)
s += count_ones(xor(h1[i], h2[i]))
end
s
end
h1 = Vector{UInt8}(randstring(hash_length))
h2 = Vector{UInt8}(randstring(hash_length))
julia> @btime hamming_distance($h1, $h2)
11.813 ns (0 allocations: 0 bytes)
The method above (based on the same BitVectors) is slower for me:
function make_bitvector(v::Vector{UInt8})
siz = sizeof(v)
bv = falses(siz<<3)
unsafe_copyto!(reinterpret(Ptr{UInt8}, pointer(bv.chunks)), pointer(v), siz)
bv
end
h1b = make_bitvector(h1)
h2b = make_bitvector(h2)
buf = similar(h1b)
julia> @btime hamming($h1b, $h2b, $buf)
16.933 ns (0 allocations: 0 bytes)
Is bitwise Hamming Distance on GPU an option to get even more speed? With my very limited GPU knowledge I could not find a way to make it faster (or even the same speed) as on CPU.
Edit: btw. I was able to beat the performance of a ~150 LOC Cython implementation with less than 30 lines of Julia code
Your approach looks good to me.
Quite late to the party, but I find that @lungben’s approach is 2x faster with
h1_64 = reinterpret(Int, h1)
h2_64 = reinterpret(Int, h2)
@btime hamming_distance($h1_64, $h2_64)
For a total of 20x speedup in the OP by replacing the @lesshaste’s hamming
function with
using LoopVectorization
function hamming(bits1, bits2)
h1, h2 = bits1.chunks, bits2.chunks
s = 0
@avx for i = eachindex(h1, h2)
s += count_ones(xor(h1[i], h2[i]))
end
s
end
I get crossover with bits1, bits2 each ~1792 bits (64 * (1024+512+256))
when using @tturbo performs better than @turbo (@avx).
This is likely a cache line effect: div(2*1792, 512) == 7