Optimization: How to make sure XOR is performed in chunks

I have written my first complete piece of Julia code as follows:

using Random: bitrand

function hamming(bits1, bits2)
    return count(bits1 .⊻ bits2)
end

function random_strings(n, N)
    return [bitrand(n) for i in 1:N]
end

function find_min(strings, n, N)
    minsofar = n
    for i in 1:N
        for j in i+1:N
            dist = hamming(strings[i], strings[j])
            if dist < minsofar
                minsofar = dist
                # print("Running min ", minsofar, "\n")
            end
        end
    end
    return minsofar
end

function ave_min(n, N)
    ITER = 100
    total = 0
    for i in 1:ITER
        strings = random_strings(n, N)
        new_min = find_min(strings, n, N)
        print("New min ", new_min, "\n")
        total += new_min  # Fixed following comment
    end
    return total/ITER
end

N = 2^13
n = 150
    
print("Overall average ", ave_min(n, N), "\n")

Unfortunately my efforts to have a super fast Hamming distance function have failed. I am told this is because the XOR in the hamming function is being performed bitwise (and hence slowly), instead of in 64 bit chunks.

I read Xor performance but it’s currently beyond me to work out exactly what I need to do to speed the code up.

Could anyone please help me speed up this code?

You should primarily worry about general Performance Tips before worrying about the speed of xor :slight_smile: Your code is creating a lot of intermediate arrays in a bunch of different places (notably random_strings and hamming) and you’re also doing some redundant work (why call find_min twice and thus recalculate the new_min value instead of adding the existing variable to the total?).

Not everything from the general performance tips is going to be applicable, but avoiding allocations is sure to speed up your code (although I have not benchmarked it yet).

1 Like

The duplication of find_min was just a mistake/typo. How can I avoid making the intermediate arrays?

The code is generally fine, and the allocations in random_strings here don’t really matter. The issue is indeed with hamming.

What you can do is pass a buffer to hamming in which to store the results of the xor:

using Random: bitrand

hamming(bits1, bits2, x) = count(map!(⊻,x,bits1,bits2))

random_strings(n, N) = [bitrand(n) for i in 1:N]

function find_min(strings, n, N, x)
    minsofar = n
    for i in 1:N
        for j in i+1:N
            dist = hamming(strings[i], strings[j], x)
            if dist < minsofar
                minsofar = dist
            end
        end
    end
    return minsofar
end

function ave_min(n, N, ITER=100)
    total = 0
    x = bitrand(n) # buffer
    for i in 1:ITER
        strings = random_strings(n, N)
        new_min = find_min(strings, n, N, x)
        println("Iteration $i min: $new_min")
        total += new_min
    end
    return total/ITER
end

I passed ITER as a variable because 100 was too much. With 10 iterations:

@time println("Overall average ", ave_min(n, N, 10))

I get the following for the original version:

 22.074000 seconds (671.26 M allocations: 45.010 GiB, 18.84% gc time)

For the in-place version:

  5.385643 seconds (164.01 k allocations: 11.882 MiB)
3 Likes

Some benchmarks of yours versus the original hamming on my machine:

julia> @benchmark hamming(x, y, z) setup=(x=bitrand(150);y=bitrand(150);z=bitrand(150))
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     25.553 ns (0.00% GC)
  median time:      32.998 ns (0.00% GC)
  mean time:        33.385 ns (0.00% GC)
  maximum time:     167.304 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     994

julia> @benchmark Hamming.hamming(x, y) setup=(x=bitrand(150);y=bitrand(150))
BenchmarkTools.Trial:
  memory estimate:  144 bytes
  allocs estimate:  2
  --------------
  minimum time:     59.120 ns (0.00% GC)
  median time:      109.120 ns (0.00% GC)
  mean time:        132.532 ns (7.03% GC)
  maximum time:     7.330 μs (95.79% GC)
  --------------
  samples:          10000
  evals/sample:     932

You can get rid of the buffer by implementing a custom function too if you want:

function hamming(A::BitArray, B::BitArray)
    #size(A) == size(B) || throw(DimensionMismatch("sizes of A and B must match"))
    Ac,Bc = A.chunks, B.chunks
    W = 0
    for i = 1:(length(Ac)-1)
        W += count_ones(Ac[i] ⊻ Bc[i])
    end
    W += count_ones(Ac[end] ⊻ Bc[end] & Base._msk_end(A))
    return W
end

This gets you another small speedup:

julia> @benchmark hamming(x,y,z) setup=(x=bitrand(n);y=bitrand(n);z=bitrand(n))
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     28.141 ns (0.00% GC)
  median time:      28.844 ns (0.00% GC)
  mean time:        29.295 ns (0.00% GC)
  maximum time:     109.950 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     995

julia> @benchmark hamming(x,y) setup=(x=bitrand(n);y=bitrand(n))
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     21.463 ns (0.00% GC)
  median time:      21.565 ns (0.00% GC)
  mean time:        21.930 ns (0.00% GC)
  maximum time:     240.221 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     997
2 Likes

It is customary to name mutating functions with a !, and normally also to put the mutated object first: so hamming!(x, b1, b2).

You can also make it faster by using broadcasting instead of map!:

hamming!(x, b1, b2) = (x .= b1 .⊻ b2; return count(x))

This is roughly as fast as the implementation that accesses the .chunks field.

Pros: simpler implementation that does not rely on internal fields and unexported undocumented functions.
Cons: needs input buffer.

That speeds up the code (for ITER = 10) from 31 to 5 seconds, which is great. Thank you.

I am now wondering if doing this in chunks somehow prevents llvm from vectorizing the code (e.g. using SIMD)?

Adding an @inbounds (in front of the for loop in hamming) probably gives you SIMD.

Here, the mutated part is only used as a buffer and is not a result of the function. In those cases I usually don’t put an exclamation mark as I don’t care what goes in or what comes out.

I did try adding @inbounds but it didn’t help, in fact, it made it worse by ~10%. I just rechecked and got the same thing. Any idea what could be going on?

The function name signals that it is safe to pass in that vector. I would definitely prefer that ! to be there.

I got an explanation that the problem is the use of bitvectors. It seems you can make it hugely faster by using UInt64 or UInt128 instead (assuming n < 129). I haven’t had a chance to test it yet though.

That seems very wrong to me. Bitvectors should be vastly faster than Uint64, since you basically do 64 operations simultaneously.

If you compute the Hamming distance between two 64 bit vectors represented as two UInt64 values that is one XOR and one popcount assembly instruction. That is most likely less than a nanosecond in total.

Ok, that makes sense. But that is, AFAIK, exactly what bitvectors do

If you look at the assembly you get from the function hamming you can see it is doing something much more expensive.

That is odd, because:

julia> a = bitrand(64*3);

julia> a.chunks
3-element Array{UInt64,1}:
 0x223aff951d3eabc4
 0x3f59f423778c8580
 0xdcf4b11767dcb777

Bitvectors are just UInt64s under the hood. That is what’s supposed to make them fast.

To quote someone else:
"
If you use an UInt64 instead of a BitVector, a few things happen:

  • Uints are stored directly in the arrays instead of pointers being stored
  • No iteration is needed in hamming . The hamming boils down to only an xor and a popcnt .
  • find_min can vectorize, running hamming in parallel.
    "

I don’t understand why bitvectors don’t do the same thing. They too are just wrapped arrays of UInt64. Iteration is needed if you have an array of UInt64s, surely.

I don’t see why they are different. Something to do with broadcasting?