How to do SIMD code with wide-register accumulators (@simd vs LoopVectorization.jl vs SIMD.jl)

I have a loop that would benefit from SIMD operations, but it has one “accumulator” variable that confuses most SIMD tools available. Here is the pseudo code, and below is the actual code with benchmarks. My questions is is there a SIMD-ify decorator like @simd or @turbo that can vectorize accumulation operations automatically? I was able to make it work by manually implementing the loop with SIMD.jl, but it would be nice to just have macros like @turbo figure it out (maybe with a bit of extra annotation). My second question is if you look at the actual code below, and its benchmarks, do you have suggestions how to make it even faster?

Pseudocode:

accumulator = 0x0
for i in 1:length(array_1)
    array_1[i] ⊻= array_2[i]
    accumulator ⊻= some_bit_fiddling_formula(array_1[i], array_2[i])
end

Attempting to use @turbo results in ERROR: LoadError: LoadError: Don't know how to handle expression. cnt2 ⊻= (((cnt1 ⊻ newx1) ⊻ newz1) ⊻ x1z2) & anti_comm where cnt1 and cnt2 are accumulator variables defined outside the loop.

Here are the benchmark results for performing the loop on arrays of size 800k:

  • base julia, no annotations: 1.5ms (the function simple below)
  • @inbounds @simd: 1.0ms (the function inboundssimd below)
  • @turbo does not work (the function lv_turbo below)
  • SIMD.jl manual implementation: 0.7ms (the function simd_vec4 below)

actual code:

using BenchmarkTools
import SIMD
import LoopVectorization
import VectorizationBase

function simple(r::Vector{UInt64}, l::Vector{UInt64})
    cnt1 = zero(eltype(r))
    cnt2 = zero(eltype(r))
    len = length(l)>>1
    for i in 1:len
        x1, x2, z1, z2 = l[i], r[i], l[i+len], r[i+len]
        r[i] = newx1 = x1 ⊻ x2
        r[i+len] = newz1 = z1 ⊻ z2
        x1z2 = x1 & z2
        anti_comm = (x2 & z1) ⊻ x1z2
        cnt2 ⊻= (cnt1 ⊻ newx1 ⊻ newz1 ⊻ x1z2) & anti_comm
        cnt1 ⊻= anti_comm
    end
    s = count_ones(cnt1)
    s ⊻= count_ones(cnt2) << 1
    s&3 # just the 2 bits matter
end

function inboundssimd(r::Vector{UInt64}, l::Vector{UInt64})
    cnt1 = zero(eltype(r))
    cnt2 = zero(eltype(r))
    len = length(l)>>1
    @inbounds @simd for i in 1:len
        x1, x2, z1, z2 = l[i], r[i], l[i+len], r[i+len]
        r[i] = newx1 = x1 ⊻ x2
        r[i+len] = newz1 = z1 ⊻ z2
        x1z2 = x1 & z2
        anti_comm = (x2 & z1) ⊻ x1z2
        cnt2 ⊻= (cnt1 ⊻ newx1 ⊻ newz1 ⊻ x1z2) & anti_comm
        cnt1 ⊻= anti_comm
    end
    s = count_ones(cnt1)
    s ⊻= count_ones(cnt2) << 1
    s&3 # just the 2 bits matter
end

# Does not work due to assignment to accumulators cnt1 and cnt2
function lv_turbo(r::Vector{UInt64}, l::Vector{UInt64})
    cnt1 = zero(eltype(r))
    cnt2 = zero(eltype(r))
    len = length(l)>>1
    LoopVectorization.@turbo for i in 1:len
        x1, x2, z1, z2 = l[i], r[i], l[i+len], r[i+len]
        newx1 = x1 ⊻ x2
        r[i] = newx1
        newz1 = z1 ⊻ z2
        r[i+len] = newz1
        x1z2 = x1 & z2
        anti_comm = (x2 & z1) ⊻ x1z2
        cnt2 ⊻= (cnt1 ⊻ newx1 ⊻ newz1 ⊻ x1z2) & anti_comm
        cnt1 ⊻= anti_comm
    end
    s = count_ones(cnt1)
    s ⊻= count_ones(cnt2) << 1
    s
end

function simd_vec4(r::Vector{UInt64}, l::Vector{UInt64})
    cnt1 = zero(SIMD.Vec{4,UInt64})
    cnt2 = zero(SIMD.Vec{4,UInt64})
    lane = SIMD.VecRange{4}(0)
    len = length(l)>>1
    @inbounds for i in 1:4:len
        x1, x2, z1, z2 = l[i+lane], r[i+lane], l[i+len+lane], r[i+len+lane]
        r[i+lane] = newx1 = x1 ⊻ x2
        r[i+len+lane] = newz1 = z1 ⊻ z2
        x1z2 = x1 & z2
        anti_comm = (x2 & z1) ⊻ x1z2
        cnt2 ⊻= (cnt1 ⊻ newx1 ⊻ newz1 ⊻ x1z2) & anti_comm
        cnt1 ⊻= anti_comm
    end
    s1 = 0
    for i in 1:length(cnt1)
        s1+=count_ones(cnt1[i])
    end
    s2 = 0
    for i in 1:length(cnt2)
        s2+=count_ones(cnt2[i])
    end
    (s1 ⊻ (s2<<1))&3 # just the 2 bits matter
end

##

a = rand(UInt64, 8*100_000);
b = rand(UInt64, 8*100_000);
@btime simple($a,$b)
@btime inboundssimd($a,$b)
#@btime lv_turbo($a,$b)
@btime simd_vec4($a,$b)

Note the answers don’t match here:

julia> simd_vec4(a,b)
3

julia> inboundssimd(a,b)
1

I don’t think it’s trivial to SIMD this correctly; it doesn’t look like loop iterations can be reordered freely without changing the result

        cnt2 ⊻= (cnt1 ⊻ newx1 ⊻ newz1 ⊻ x1z2) & anti_comm
        cnt1 ⊻= anti_comm

The arrays are modified in-place. It is not weird that the results are different given that a is a different array now. However, the modification should be cyclical, so running the function twice should undo any changes.

Analytically, it can be proven that reordering does not matter.

EDIT: there was a small typo in the last function though. Fixed now.

julia> using StableRNGs

julia> a = rand(StableRNG(1), UInt64, 8*100_000);

julia> b = rand(StableRNG(2), UInt64, 8*100_000);

julia> a2 = similar(a); b2 = similar(b);

julia> simd_vec4(copyto!(a2,a),copyto!(b2,b)) == inboundssimd(copyto!(a2,a),copyto!(b2,b))
false
1 Like

yup, fixed now: the last function used cnt1 instead of cnt2 in its last line.

1 Like

Okay, great. Can confirm it matches.

I’d have to look more closely to find out how to prove associativity.
Having the vectorizer understand that is probably necessary for it to find out that += count_ones(cnt[i]) is the correct way to reduce the expanded vector.

Another alternative is to add support for a user telling it how to do so / passing the functions to use, and then have it just trust that.

Slightly slower than the SIMD.jl version for me:

function count_ones_test(r::Vector{UInt64}, l::Vector{UInt64})
  s1 = cnt1 = zero(UInt64) # function was returning `UInt` even when `s1 = s2 = 0`
  s2 = cnt2 = zero(UInt64)
  len = length(l)>>1
  @turbo for i in 1:len
    x1, x2, z1, z2 = l[i], r[i], l[i+len], r[i+len]
    newx1 = x1 ⊻ x2
    r[i] = newx1
    newz1 = z1 ⊻ z2
    r[i+len] = newz1
    x1z2 = x1 & z2
    anti_comm = (x2 & z1) ⊻ x1z2
    s2 += count_ones(cnt2 ⊻ (cnt1 ⊻ newx1 ⊻ newz1 ⊻ x1z2) & anti_comm)
    s1 += count_ones(cnt1 ⊻ anti_comm)
  end
  (s1 ⊻ (s2<<1))&3 # just the 2 bits matter
end

Benchmarks:

julia> @btime count_ones_test($a2,$b2) # original size
  396.156 μs (0 allocations: 0 bytes)
0x0000000000000003

julia> @btime simd_vec4($a2,$b2)
  387.904 μs (0 allocations: 0 bytes)
3

julia> a = rand(UInt64, 8*1_000);

julia> b = rand(UInt64, 8*1_000);

julia> @btime count_ones_test($a,$b)
  1.964 μs (0 allocations: 0 bytes)
0x0000000000000003

julia> @btime simd_vec4($a,$b)
  1.510 μs (0 allocations: 0 bytes)
3

julia> a = rand(UInt64, 8*100);

julia> b = rand(UInt64, 8*100);

julia> @btime count_ones_test($a,$b)
  198.594 ns (0 allocations: 0 bytes)
0x0000000000000003

julia> @btime simd_vec4($a,$b)
  151.630 ns (0 allocations: 0 bytes)
3

Different computer:

julia> @btime count_ones_test($a,$b)
  1.898 ms (0 allocations: 0 bytes)
0x0000000000000001

julia> @btime simd_vec4($a,$b)
  1.959 ms (0 allocations: 0 bytes)
1

julia> a = rand(UInt64, 8*1_000);

julia> b = rand(UInt64, 8*1_000);

julia> @btime count_ones_test($a,$b)
  6.332 μs (0 allocations: 0 bytes)
0x0000000000000001

julia> @btime simd_vec4($a,$b)
  5.290 μs (0 allocations: 0 bytes)
1

julia> a = rand(UInt64, 8*100);

julia> b = rand(UInt64, 8*100);

julia> @btime count_ones_test($a,$b)
  629.300 ns (0 allocations: 0 bytes)
0x0000000000000002

julia> @btime simd_vec4($a,$b)
  485.497 ns (0 allocations: 0 bytes)
2

Your solution is faster. But LoopVectorization handles + correctly.

1 Like

So the trick is to rewrite loops so that in each iteration one does a reduction over the SIMD registers instead of doing the reduction outside of the loop (which requires explicit declaration of a SIMD register outside the loop)? This is really neat!

Could you elaborate, what do you mean by “But LoopVectorization handles + correctly”? Is there a bug in my code that I missed?

I also do confirm the benchmark results.

julia> @btime simple($a,$b)
  1.411 ms (0 allocations: 0 bytes)

julia> @btime inboundssimd($a,$b)
  985.651 μs (0 allocations: 0 bytes)

julia> @btime simd_vec4($a,$b)
  679.541 μs (0 allocations: 0 bytes)

julia> @btime lv_count_ones_test($a,$b)
  726.223 μs (0 allocations: 0 bytes)

And I assume the fact that xor and count_ones commute is something not currently used in the optimizations done by LoopVectorization?

Yes. It doesn’t know anything about count_ones or xor, actually. That’s what I meant about handling + – LoopVectorization doesn’t know what to do with xor or count_ones.
Also, note that from code like this:

    @turbo for i in 1:len
        x1, x2, z1, z2 = l[i], r[i], l[i+len], r[i+len]
        r[i] = newx1 = x1 ⊻ x2
        r[i+len] = newz1 = z1 ⊻ z2
        x1z2 = x1 & z2
        anti_comm = (x2 & z1) ⊻ x1z2
        cnt2 ⊻= (cnt1 ⊻ newx1 ⊻ newz1 ⊻ x1z2) & anti_comm
        cnt1 ⊻= anti_comm
    end

LoopVectorization can’t know that count_ones is involved at all. It’d have to try and produce cnt1 and cnt2 scalars equal to the scalars you get from the serial loop.
Which is why:

From there, a potential future optimization is to have it move the count_ones outside of the loop and produce the optimized version.

2 Likes

Thank you, this has been extremely instructive and helpful! Is it on LoopVectorization’s roadmap to permit some form of user provided assumptions? E.g. “assume this function commutes with this associative operator”. That would enable moving count_ones outside of the loop.

There isn’t an official road map, but I think something like that can be added in the future. Feel free to file an issue for tracking it.

1 Like

feature request submitted here manually provide reduction isomorphisms so that expensive operations can be factored out of the loop · Issue #291 · JuliaSIMD/LoopVectorization.jl · GitHub

1 Like