Optimization: How to make sure XOR is performed in chunks

lesshaste · January 29, 2020, 8:19pm

I have written my first complete piece of Julia code as follows:

using Random: bitrand

function hamming(bits1, bits2)
    return count(bits1 .⊻ bits2)
end

function random_strings(n, N)
    return [bitrand(n) for i in 1:N]
end

function find_min(strings, n, N)
    minsofar = n
    for i in 1:N
        for j in i+1:N
            dist = hamming(strings[i], strings[j])
            if dist < minsofar
                minsofar = dist
                # print("Running min ", minsofar, "\n")
            end
        end
    end
    return minsofar
end

function ave_min(n, N)
    ITER = 100
    total = 0
    for i in 1:ITER
        strings = random_strings(n, N)
        new_min = find_min(strings, n, N)
        print("New min ", new_min, "\n")
        total += new_min  # Fixed following comment
    end
    return total/ITER
end

N = 2^13
n = 150
    
print("Overall average ", ave_min(n, N), "\n")

Unfortunately my efforts to have a super fast Hamming distance function have failed. I am told this is because the XOR in the hamming function is being performed bitwise (and hence slowly), instead of in 64 bit chunks.

I read Xor performance - #2 by Elrod but it’s currently beyond me to work out exactly what I need to do to speed the code up.

Could anyone please help me speed up this code?

Sukera · January 29, 2020, 9:01pm

You should primarily worry about general Performance Tips before worrying about the speed of xor Your code is creating a lot of intermediate arrays in a bunch of different places (notably random_strings and hamming) and you’re also doing some redundant work (why call find_min twice and thus recalculate the new_min value instead of adding the existing variable to the total?).

Not everything from the general performance tips is going to be applicable, but avoiding allocations is sure to speed up your code (although I have not benchmarked it yet).

lesshaste · January 29, 2020, 9:04pm

The duplication of find_min was just a mistake/typo. How can I avoid making the intermediate arrays?

jebej · January 29, 2020, 9:19pm

The code is generally fine, and the allocations in random_strings here don’t really matter. The issue is indeed with hamming.

What you can do is pass a buffer to hamming in which to store the results of the xor:

using Random: bitrand

hamming(bits1, bits2, x) = count(map!(⊻,x,bits1,bits2))

random_strings(n, N) = [bitrand(n) for i in 1:N]

function find_min(strings, n, N, x)
    minsofar = n
    for i in 1:N
        for j in i+1:N
            dist = hamming(strings[i], strings[j], x)
            if dist < minsofar
                minsofar = dist
            end
        end
    end
    return minsofar
end

function ave_min(n, N, ITER=100)
    total = 0
    x = bitrand(n) # buffer
    for i in 1:ITER
        strings = random_strings(n, N)
        new_min = find_min(strings, n, N, x)
        println("Iteration $i min: $new_min")
        total += new_min
    end
    return total/ITER
end

I passed ITER as a variable because 100 was too much. With 10 iterations:

@time println("Overall average ", ave_min(n, N, 10))

I get the following for the original version:

 22.074000 seconds (671.26 M allocations: 45.010 GiB, 18.84% gc time)

For the in-place version:

  5.385643 seconds (164.01 k allocations: 11.882 MiB)

Sukera · January 29, 2020, 9:25pm

Some benchmarks of yours versus the original hamming on my machine:

julia> @benchmark hamming(x, y, z) setup=(x=bitrand(150);y=bitrand(150);z=bitrand(150))
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     25.553 ns (0.00% GC)
  median time:      32.998 ns (0.00% GC)
  mean time:        33.385 ns (0.00% GC)
  maximum time:     167.304 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     994

julia> @benchmark Hamming.hamming(x, y) setup=(x=bitrand(150);y=bitrand(150))
BenchmarkTools.Trial:
  memory estimate:  144 bytes
  allocs estimate:  2
  --------------
  minimum time:     59.120 ns (0.00% GC)
  median time:      109.120 ns (0.00% GC)
  mean time:        132.532 ns (7.03% GC)
  maximum time:     7.330 μs (95.79% GC)
  --------------
  samples:          10000
  evals/sample:     932

jebej · January 29, 2020, 10:01pm

You can get rid of the buffer by implementing a custom function too if you want:

function hamming(A::BitArray, B::BitArray)
    #size(A) == size(B) || throw(DimensionMismatch("sizes of A and B must match"))
    Ac,Bc = A.chunks, B.chunks
    W = 0
    for i = 1:(length(Ac)-1)
        W += count_ones(Ac[i] ⊻ Bc[i])
    end
    W += count_ones(Ac[end] ⊻ Bc[end] & Base._msk_end(A))
    return W
end

This gets you another small speedup:

julia> @benchmark hamming(x,y,z) setup=(x=bitrand(n);y=bitrand(n);z=bitrand(n))
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     28.141 ns (0.00% GC)
  median time:      28.844 ns (0.00% GC)
  mean time:        29.295 ns (0.00% GC)
  maximum time:     109.950 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     995

julia> @benchmark hamming(x,y) setup=(x=bitrand(n);y=bitrand(n))
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     21.463 ns (0.00% GC)
  median time:      21.565 ns (0.00% GC)
  mean time:        21.930 ns (0.00% GC)
  maximum time:     240.221 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     997

DNF · January 30, 2020, 8:45am

It is customary to name mutating functions with a !, and normally also to put the mutated object first: so hamming!(x, b1, b2).

You can also make it faster by using broadcasting instead of map!:

hamming!(x, b1, b2) = (x .= b1 .⊻ b2; return count(x))

This is roughly as fast as the implementation that accesses the .chunks field.

Pros: simpler implementation that does not rely on internal fields and unexported undocumented functions.
Cons: needs input buffer.

lesshaste · January 30, 2020, 8:58am

That speeds up the code (for ITER = 10) from 31 to 5 seconds, which is great. Thank you.

I am now wondering if doing this in chunks somehow prevents llvm from vectorizing the code (e.g. using SIMD)?

kristoffer.carlsson · January 30, 2020, 12:20pm

Adding an @inbounds (in front of the for loop in hamming) probably gives you SIMD.

jebej · January 30, 2020, 5:31pm

Here, the mutated part is only used as a buffer and is not a result of the function. In those cases I usually don’t put an exclamation mark as I don’t care what goes in or what comes out.

jebej · January 30, 2020, 5:35pm

I did try adding @inbounds but it didn’t help, in fact, it made it worse by ~10%. I just rechecked and got the same thing. Any idea what could be going on?

DNF · January 30, 2020, 5:49pm

The function name signals that it is safe to pass in that vector. I would definitely prefer that ! to be there.

lesshaste · January 30, 2020, 6:06pm

I got an explanation that the problem is the use of bitvectors. It seems you can make it hugely faster by using UInt64 or UInt128 instead (assuming n < 129). I haven’t had a chance to test it yet though.

DNF · January 30, 2020, 6:19pm

That seems very wrong to me. Bitvectors should be vastly faster than Uint64, since you basically do 64 operations simultaneously.

lesshaste · January 30, 2020, 6:33pm

If you compute the Hamming distance between two 64 bit vectors represented as two UInt64 values that is one XOR and one popcount assembly instruction. That is most likely less than a nanosecond in total.

DNF · January 30, 2020, 6:38pm

Ok, that makes sense. But that is, AFAIK, exactly what bitvectors do

lesshaste · January 30, 2020, 6:39pm

If you look at the assembly you get from the function hamming you can see it is doing something much more expensive.

DNF · January 30, 2020, 6:41pm

That is odd, because:

julia> a = bitrand(64*3);

julia> a.chunks
3-element Array{UInt64,1}:
 0x223aff951d3eabc4
 0x3f59f423778c8580
 0xdcf4b11767dcb777

Bitvectors are just UInt64s under the hood. That is what’s supposed to make them fast.

lesshaste · January 30, 2020, 6:45pm

To quote someone else:
"
If you use an UInt64 instead of a BitVector, a few things happen:

Uints are stored directly in the arrays instead of pointers being stored
No iteration is needed in hamming . The hamming boils down to only an xor and a popcnt .
find_min can vectorize, running hamming in parallel.
"

DNF · January 30, 2020, 6:48pm

I don’t understand why bitvectors don’t do the same thing. They too are just wrapped arrays of UInt64. Iteration is needed if you have an array of UInt64s, surely.

I don’t see why they are different. Something to do with broadcasting?

Topic		Replies	Views
How to create a random binary array with b bits? New to Julia	15	4461	January 29, 2020
BitVector vs Vector{Bool} as default on comparison operations Performance	8	9408	November 2, 2020
Understanding why one function is faster than the other New to Julia	16	1230	November 1, 2020
Fastest way to count unique elements in `Vector{Union{Bool, Missing})` Performance	35	7006	January 16, 2018
Xor performance Performance question	3	2530	October 18, 2019

Optimization: How to make sure XOR is performed in chunks

Related topics