# Optimization: How to make sure XOR is performed in chunks

I have written my first complete piece of Julia code as follows:

``````using Random: bitrand

function hamming(bits1, bits2)
return count(bits1 .⊻ bits2)
end

function random_strings(n, N)
return [bitrand(n) for i in 1:N]
end

function find_min(strings, n, N)
minsofar = n
for i in 1:N
for j in i+1:N
dist = hamming(strings[i], strings[j])
if dist < minsofar
minsofar = dist
# print("Running min ", minsofar, "\n")
end
end
end
return minsofar
end

function ave_min(n, N)
ITER = 100
total = 0
for i in 1:ITER
strings = random_strings(n, N)
new_min = find_min(strings, n, N)
print("New min ", new_min, "\n")
total += new_min  # Fixed following comment
end
end

N = 2^13
n = 150

print("Overall average ", ave_min(n, N), "\n")
``````

Unfortunately my efforts to have a super fast Hamming distance function have failed. I am told this is because the XOR in the hamming function is being performed bitwise (and hence slowly), instead of in 64 bit chunks.

I read Xor performance but it’s currently beyond me to work out exactly what I need to do to speed the code up.

You should primarily worry about general Performance Tips before worrying about the speed of `xor` Your code is creating a lot of intermediate arrays in a bunch of different places (notably `random_strings` and `hamming`) and you’re also doing some redundant work (why call `find_min` twice and thus recalculate the `new_min` value instead of adding the existing variable to the total?).

Not everything from the general performance tips is going to be applicable, but avoiding allocations is sure to speed up your code (although I have not benchmarked it yet).

1 Like

The duplication of find_min was just a mistake/typo. How can I avoid making the intermediate arrays?

The code is generally fine, and the allocations in `random_strings` here don’t really matter. The issue is indeed with `hamming`.

What you can do is pass a buffer to hamming in which to store the results of the `xor`:

``````using Random: bitrand

hamming(bits1, bits2, x) = count(map!(⊻,x,bits1,bits2))

random_strings(n, N) = [bitrand(n) for i in 1:N]

function find_min(strings, n, N, x)
minsofar = n
for i in 1:N
for j in i+1:N
dist = hamming(strings[i], strings[j], x)
if dist < minsofar
minsofar = dist
end
end
end
return minsofar
end

function ave_min(n, N, ITER=100)
total = 0
x = bitrand(n) # buffer
for i in 1:ITER
strings = random_strings(n, N)
new_min = find_min(strings, n, N, x)
println("Iteration \$i min: \$new_min")
total += new_min
end
end
``````

I passed `ITER` as a variable because 100 was too much. With 10 iterations:

``````@time println("Overall average ", ave_min(n, N, 10))
``````

I get the following for the original version:

`````` 22.074000 seconds (671.26 M allocations: 45.010 GiB, 18.84% gc time)
``````

For the in-place version:

``````  5.385643 seconds (164.01 k allocations: 11.882 MiB)
``````
3 Likes

Some benchmarks of yours versus the original `hamming` on my machine:

``````julia> @benchmark hamming(x, y, z) setup=(x=bitrand(150);y=bitrand(150);z=bitrand(150))
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     25.553 ns (0.00% GC)
median time:      32.998 ns (0.00% GC)
mean time:        33.385 ns (0.00% GC)
maximum time:     167.304 ns (0.00% GC)
--------------
samples:          10000
evals/sample:     994

julia> @benchmark Hamming.hamming(x, y) setup=(x=bitrand(150);y=bitrand(150))
BenchmarkTools.Trial:
memory estimate:  144 bytes
allocs estimate:  2
--------------
minimum time:     59.120 ns (0.00% GC)
median time:      109.120 ns (0.00% GC)
mean time:        132.532 ns (7.03% GC)
maximum time:     7.330 μs (95.79% GC)
--------------
samples:          10000
evals/sample:     932
``````

You can get rid of the buffer by implementing a custom function too if you want:

``````function hamming(A::BitArray, B::BitArray)
#size(A) == size(B) || throw(DimensionMismatch("sizes of A and B must match"))
Ac,Bc = A.chunks, B.chunks
W = 0
for i = 1:(length(Ac)-1)
W += count_ones(Ac[i] ⊻ Bc[i])
end
W += count_ones(Ac[end] ⊻ Bc[end] & Base._msk_end(A))
return W
end
``````

This gets you another small speedup:

``````julia> @benchmark hamming(x,y,z) setup=(x=bitrand(n);y=bitrand(n);z=bitrand(n))
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     28.141 ns (0.00% GC)
median time:      28.844 ns (0.00% GC)
mean time:        29.295 ns (0.00% GC)
maximum time:     109.950 ns (0.00% GC)
--------------
samples:          10000
evals/sample:     995

julia> @benchmark hamming(x,y) setup=(x=bitrand(n);y=bitrand(n))
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     21.463 ns (0.00% GC)
median time:      21.565 ns (0.00% GC)
mean time:        21.930 ns (0.00% GC)
maximum time:     240.221 ns (0.00% GC)
--------------
samples:          10000
evals/sample:     997
``````
2 Likes

It is customary to name mutating functions with a `!`, and normally also to put the mutated object first: so `hamming!(x, b1, b2)`.

You can also make it faster by using broadcasting instead of `map!`:

``````hamming!(x, b1, b2) = (x .= b1 .⊻ b2; return count(x))
``````

This is roughly as fast as the implementation that accesses the `.chunks` field.

Pros: simpler implementation that does not rely on internal fields and unexported undocumented functions.
Cons: needs input buffer.

That speeds up the code (for ITER = 10) from 31 to 5 seconds, which is great. Thank you.

I am now wondering if doing this in chunks somehow prevents llvm from vectorizing the code (e.g. using SIMD)?

Adding an `@inbounds` (in front of the `for` loop in `hamming`) probably gives you SIMD.

Here, the mutated part is only used as a buffer and is not a result of the function. In those cases I usually don’t put an exclamation mark as I don’t care what goes in or what comes out.

I did try adding `@inbounds` but it didn’t help, in fact, it made it worse by ~10%. I just rechecked and got the same thing. Any idea what could be going on?

The function name signals that it is safe to pass in that vector. I would definitely prefer that `!` to be there.

I got an explanation that the problem is the use of bitvectors. It seems you can make it hugely faster by using UInt64 or UInt128 instead (assuming n < 129). I haven’t had a chance to test it yet though.

That seems very wrong to me. Bitvectors should be vastly faster than Uint64, since you basically do 64 operations simultaneously.

If you compute the Hamming distance between two 64 bit vectors represented as two UInt64 values that is one XOR and one popcount assembly instruction. That is most likely less than a nanosecond in total.

Ok, that makes sense. But that is, AFAIK, exactly what bitvectors do

If you look at the assembly you get from the function hamming you can see it is doing something much more expensive.

That is odd, because:

``````julia> a = bitrand(64*3);

julia> a.chunks
3-element Array{UInt64,1}:
0x223aff951d3eabc4
0x3f59f423778c8580
0xdcf4b11767dcb777
``````

Bitvectors are just `UInt64`s under the hood. That is what’s supposed to make them fast.

To quote someone else:
"
If you use an `UInt64` instead of a BitVector, a few things happen:

• Uints are stored directly in the arrays instead of pointers being stored
• No iteration is needed in `hamming` . The `hamming` boils down to only an `xor` and a `popcnt` .
• `find_min` can vectorize, running `hamming` in parallel.
"

I don’t understand why bitvectors don’t do the same thing. They too are just wrapped arrays of `UInt64`. Iteration is needed if you have an array of `UInt64`s, surely.

I don’t see why they are different. Something to do with broadcasting?