If you compute the Hamming distance between two 64 bit vectors represented as two UInt64 values that is one XOR and one popcount assembly instruction. That is most likely less than a nanosecond in total.
Ok, that makes sense. But that is, AFAIK, exactly what bitvectors do
If you look at the assembly you get from the function hamming you can see it is doing something much more expensive.
That is odd, because:
julia> a = bitrand(64*3);
julia> a.chunks
3-element Array{UInt64,1}:
0x223aff951d3eabc4
0x3f59f423778c8580
0xdcf4b11767dcb777
Bitvectors are just UInt64
s under the hood. That is what’s supposed to make them fast.
To quote someone else:
"
If you use an UInt64
instead of a BitVector, a few things happen:
- Uints are stored directly in the arrays instead of pointers being stored
- No iteration is needed in
hamming
. Thehamming
boils down to only anxor
and apopcnt
. -
find_min
can vectorize, runninghamming
in parallel.
"
I don’t understand why bitvectors don’t do the same thing. They too are just wrapped arrays of UInt64
. Iteration is needed if you have an array of UInt64
s, surely.
I don’t see why they are different. Something to do with broadcasting?
Maybe differs on the CPU if the AVX is worth it. Was quite a lot faster for me.
Yes you would need to iterate if you have more than 128 bits. Luckily there is also UIint128 for n < 129.
It is (basically) the same as long as you use broadcast =
as well. But doing a single xor
on a pair of UInt64
is faster still (I’m assuming something to do with arrays being allocated on the heap instead of the stack?). Some timings:
julia> x = bitrand(64);
julia> y = bitrand(64);
julia> z = similar(x);
julia> @btime $z .= $x .⊻ $y;
9.309 ns (0 allocations: 0 bytes)
julia> x = rand(UInt64, 1);
julia> y = rand(UInt64, 1);
julia> z = similar(x);
julia> @btime $z .= $x .⊻ $y;
9.021 ns (0 allocations: 0 bytes)
julia> x = rand(UInt64);
julia> y = rand(UInt64);
julia> @btime $x ⊻ $y;
0.019 ns (0 allocations: 0 bytes)
julia> x = rand(UInt128);
julia> y = rand(UInt128);
julia> @btime $x ⊻ $y
0.019 ns (0 allocations: 0 bytes)
EDIT: for the record, @inbounds
made the timings worse for the BitVector
and Vector{UInt64}
cases on my computer as well.
EDIT2: The real question I have is why using broadcast!(xor, z, x, y)
is slower than z .= xor.(x, y)
:
julia> x = bitrand(64);
julia> y = bitrand(64);
julia> z = similar(x);
julia> @btime broadcast!(⊻, $z, $x, $y);
18.687 ns (0 allocations: 0 bytes)
julia> x = rand(UInt64, 1);
julia> y = rand(UInt64, 1);
julia> z = similar(x);
julia> @btime broadcast!($⊻, $z, $x, $y);
16.280 ns (0 allocations: 0 bytes)
Okay, but generally you have more, or rather, an unknown number of samples, I would think.
As for the sub-nanosecond timings, those are not real, when that happens, it’s the compiler eliding the entire computation.
Right of course:
julia> x = Ref(rand(UInt64));
julia> y = Ref(rand(UInt64));
julia> @btime $x[] ⊻ $y[];
1.750 ns (0 allocations: 0 bytes)
julia> x = Ref(rand(UInt128));
julia> y = Ref(rand(UInt128));
julia> @btime x[] ⊻ $y[];
31.367 ns (2 allocations: 64 bytes)
@lesshaste the important bit here for you is that using UInt128
is slower than BitVector
.
An other trick to see the real timings is to use the setup
feature of BenchmarkTools:
julia> @benchmark a ⊻ b setup=((a,b)=rand(UInt,2))
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 1.699 ns (0.00% GC)
median time: 1.800 ns (0.00% GC)
mean time: 1.769 ns (0.00% GC)
maximum time: 11.200 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 1000
If you don’t plan on having more than 64 bits, then it’s certainly faster to work with UInt64 values directly. You can easily make a BitVector64 type that acts like a 64-element boolean vector but is represented as a single UInt64 value. It may or may not be worth it to have something that behaves in a vectorlike fashion. BitVectors have more overhead since they are arbitrary length.
You forgot a crucial part:
julia> @btime x[] ⊻ $y[];
55.262 ns (3 allocations: 96 bytes)
julia> @btime $x[] ⊻ $y[];
1.792 ns (0 allocations: 0 bytes)
Perhaps a StaticBitVector
type that contains a tuple of Int64
would be the best of both worlds. If someone has the time and is working on this anyway, implementing
would be a great addition.
An alternative implementation idea is to use the Bits
package, which arleady has a BitVector1
type, which wraps a single integer and is supposed to have a similar interface to BitVector
. It’s on my radar to develop this type a bit and make it more useful. Then this can be used with integer types of any bit-size to emulate a kind of “static bit-vector”.
A bit late to the party, but I just had a similar use case today.
The fastest method I could find was working on UInt8 instead of BitArrays, count_ones
and LoopVectorization:
function hamming_distance(h1, h2)
s = 0
@avx for i = 1:length(h1)
s += count_ones(xor(h1[i], h2[i]))
end
s
end
h1 = Vector{UInt8}(randstring(hash_length))
h2 = Vector{UInt8}(randstring(hash_length))
julia> @btime hamming_distance($h1, $h2)
11.813 ns (0 allocations: 0 bytes)
The method above (based on the same BitVectors) is slower for me:
function make_bitvector(v::Vector{UInt8})
siz = sizeof(v)
bv = falses(siz<<3)
unsafe_copyto!(reinterpret(Ptr{UInt8}, pointer(bv.chunks)), pointer(v), siz)
bv
end
h1b = make_bitvector(h1)
h2b = make_bitvector(h2)
buf = similar(h1b)
julia> @btime hamming($h1b, $h2b, $buf)
16.933 ns (0 allocations: 0 bytes)
Is bitwise Hamming Distance on GPU an option to get even more speed? With my very limited GPU knowledge I could not find a way to make it faster (or even the same speed) as on CPU.
Edit: btw. I was able to beat the performance of a ~150 LOC Cython implementation with less than 30 lines of Julia code
Your approach looks good to me.
Quite late to the party, but I find that @lungben’s approach is 2x faster with
h1_64 = reinterpret(Int, h1)
h2_64 = reinterpret(Int, h2)
@btime hamming_distance($h1_64, $h2_64)
For a total of 20x speedup in the OP by replacing the @lesshaste’s hamming
function with
using LoopVectorization
function hamming(bits1, bits2)
h1, h2 = bits1.chunks, bits2.chunks
s = 0
@avx for i = eachindex(h1, h2)
s += count_ones(xor(h1[i], h2[i]))
end
s
end
I get crossover with bits1, bits2 each ~1792 bits (64 * (1024+512+256))
when using @tturbo performs better than @turbo (@avx).
This is likely a cache line effect: div(2*1792, 512) == 7