Optimization: How to make sure XOR is performed in chunks

You can get rid of the buffer by implementing a custom function too if you want:

function hamming(A::BitArray, B::BitArray)
    #size(A) == size(B) || throw(DimensionMismatch("sizes of A and B must match"))
    Ac,Bc = A.chunks, B.chunks
    W = 0
    for i = 1:(length(Ac)-1)
        W += count_ones(Ac[i] ⊻ Bc[i])
    W += count_ones(Ac[end] ⊻ Bc[end] & Base._msk_end(A))
    return W

This gets you another small speedup:

julia> @benchmark hamming(x,y,z) setup=(x=bitrand(n);y=bitrand(n);z=bitrand(n))
  memory estimate:  0 bytes
  allocs estimate:  0
  minimum time:     28.141 ns (0.00% GC)
  median time:      28.844 ns (0.00% GC)
  mean time:        29.295 ns (0.00% GC)
  maximum time:     109.950 ns (0.00% GC)
  samples:          10000
  evals/sample:     995

julia> @benchmark hamming(x,y) setup=(x=bitrand(n);y=bitrand(n))
  memory estimate:  0 bytes
  allocs estimate:  0
  minimum time:     21.463 ns (0.00% GC)
  median time:      21.565 ns (0.00% GC)
  mean time:        21.930 ns (0.00% GC)
  maximum time:     240.221 ns (0.00% GC)
  samples:          10000
  evals/sample:     997