# How to do SIMD code with wide-register accumulators (@simd vs LoopVectorization.jl vs SIMD.jl)

I have a loop that would benefit from SIMD operations, but it has one “accumulator” variable that confuses most SIMD tools available. Here is the pseudo code, and below is the actual code with benchmarks. My questions is is there a SIMD-ify decorator like `@simd` or `@turbo` that can vectorize accumulation operations automatically? I was able to make it work by manually implementing the loop with SIMD.jl, but it would be nice to just have macros like @turbo figure it out (maybe with a bit of extra annotation). My second question is if you look at the actual code below, and its benchmarks, do you have suggestions how to make it even faster?

Pseudocode:

``````accumulator = 0x0
for i in 1:length(array_1)
array_1[i] ⊻= array_2[i]
accumulator ⊻= some_bit_fiddling_formula(array_1[i], array_2[i])
end
``````

Attempting to use `@turbo` results in `ERROR: LoadError: LoadError: Don't know how to handle expression. cnt2 ⊻= (((cnt1 ⊻ newx1) ⊻ newz1) ⊻ x1z2) & anti_comm` where `cnt1` and `cnt2` are accumulator variables defined outside the loop.

Here are the benchmark results for performing the loop on arrays of size 800k:

• base julia, no annotations: 1.5ms (the function `simple` below)
• @inbounds @simd: 1.0ms (the function `inboundssimd` below)
• @turbo does not work (the function `lv_turbo` below)
• SIMD.jl manual implementation: 0.7ms (the function `simd_vec4` below)

actual code:

``````using BenchmarkTools
import SIMD
import LoopVectorization
import VectorizationBase

function simple(r::Vector{UInt64}, l::Vector{UInt64})
cnt1 = zero(eltype(r))
cnt2 = zero(eltype(r))
len = length(l)>>1
for i in 1:len
x1, x2, z1, z2 = l[i], r[i], l[i+len], r[i+len]
r[i] = newx1 = x1 ⊻ x2
r[i+len] = newz1 = z1 ⊻ z2
x1z2 = x1 & z2
anti_comm = (x2 & z1) ⊻ x1z2
cnt2 ⊻= (cnt1 ⊻ newx1 ⊻ newz1 ⊻ x1z2) & anti_comm
cnt1 ⊻= anti_comm
end
s = count_ones(cnt1)
s ⊻= count_ones(cnt2) << 1
s&3 # just the 2 bits matter
end

function inboundssimd(r::Vector{UInt64}, l::Vector{UInt64})
cnt1 = zero(eltype(r))
cnt2 = zero(eltype(r))
len = length(l)>>1
@inbounds @simd for i in 1:len
x1, x2, z1, z2 = l[i], r[i], l[i+len], r[i+len]
r[i] = newx1 = x1 ⊻ x2
r[i+len] = newz1 = z1 ⊻ z2
x1z2 = x1 & z2
anti_comm = (x2 & z1) ⊻ x1z2
cnt2 ⊻= (cnt1 ⊻ newx1 ⊻ newz1 ⊻ x1z2) & anti_comm
cnt1 ⊻= anti_comm
end
s = count_ones(cnt1)
s ⊻= count_ones(cnt2) << 1
s&3 # just the 2 bits matter
end

# Does not work due to assignment to accumulators cnt1 and cnt2
function lv_turbo(r::Vector{UInt64}, l::Vector{UInt64})
cnt1 = zero(eltype(r))
cnt2 = zero(eltype(r))
len = length(l)>>1
LoopVectorization.@turbo for i in 1:len
x1, x2, z1, z2 = l[i], r[i], l[i+len], r[i+len]
newx1 = x1 ⊻ x2
r[i] = newx1
newz1 = z1 ⊻ z2
r[i+len] = newz1
x1z2 = x1 & z2
anti_comm = (x2 & z1) ⊻ x1z2
cnt2 ⊻= (cnt1 ⊻ newx1 ⊻ newz1 ⊻ x1z2) & anti_comm
cnt1 ⊻= anti_comm
end
s = count_ones(cnt1)
s ⊻= count_ones(cnt2) << 1
s
end

function simd_vec4(r::Vector{UInt64}, l::Vector{UInt64})
cnt1 = zero(SIMD.Vec{4,UInt64})
cnt2 = zero(SIMD.Vec{4,UInt64})
lane = SIMD.VecRange{4}(0)
len = length(l)>>1
@inbounds for i in 1:4:len
x1, x2, z1, z2 = l[i+lane], r[i+lane], l[i+len+lane], r[i+len+lane]
r[i+lane] = newx1 = x1 ⊻ x2
r[i+len+lane] = newz1 = z1 ⊻ z2
x1z2 = x1 & z2
anti_comm = (x2 & z1) ⊻ x1z2
cnt2 ⊻= (cnt1 ⊻ newx1 ⊻ newz1 ⊻ x1z2) & anti_comm
cnt1 ⊻= anti_comm
end
s1 = 0
for i in 1:length(cnt1)
s1+=count_ones(cnt1[i])
end
s2 = 0
for i in 1:length(cnt2)
s2+=count_ones(cnt2[i])
end
(s1 ⊻ (s2<<1))&3 # just the 2 bits matter
end

##

a = rand(UInt64, 8*100_000);
b = rand(UInt64, 8*100_000);
@btime simple(\$a,\$b)
@btime inboundssimd(\$a,\$b)
#@btime lv_turbo(\$a,\$b)
@btime simd_vec4(\$a,\$b)

``````

Note the answers don’t match here:

``````julia> simd_vec4(a,b)
3

julia> inboundssimd(a,b)
1
``````

I don’t think it’s trivial to SIMD this correctly; it doesn’t look like loop iterations can be reordered freely without changing the result

``````        cnt2 ⊻= (cnt1 ⊻ newx1 ⊻ newz1 ⊻ x1z2) & anti_comm
cnt1 ⊻= anti_comm
``````

The arrays are modified in-place. It is not weird that the results are different given that `a` is a different array now. However, the modification should be cyclical, so running the function twice should undo any changes.

Analytically, it can be proven that reordering does not matter.

EDIT: there was a small typo in the last function though. Fixed now.

``````julia> using StableRNGs

julia> a = rand(StableRNG(1), UInt64, 8*100_000);

julia> b = rand(StableRNG(2), UInt64, 8*100_000);

julia> a2 = similar(a); b2 = similar(b);

julia> simd_vec4(copyto!(a2,a),copyto!(b2,b)) == inboundssimd(copyto!(a2,a),copyto!(b2,b))
false
``````
1 Like

yup, fixed now: the last function used `cnt1` instead of `cnt2` in its last line.

1 Like

Okay, great. Can confirm it matches.

I’d have to look more closely to find out how to prove associativity.
Having the vectorizer understand that is probably necessary for it to find out that `+= count_ones(cnt[i])` is the correct way to reduce the expanded vector.

Another alternative is to add support for a user telling it how to do so / passing the functions to use, and then have it just trust that.

Slightly slower than the `SIMD.jl` version for me:

``````function count_ones_test(r::Vector{UInt64}, l::Vector{UInt64})
s1 = cnt1 = zero(UInt64) # function was returning `UInt` even when `s1 = s2 = 0`
s2 = cnt2 = zero(UInt64)
len = length(l)>>1
@turbo for i in 1:len
x1, x2, z1, z2 = l[i], r[i], l[i+len], r[i+len]
newx1 = x1 ⊻ x2
r[i] = newx1
newz1 = z1 ⊻ z2
r[i+len] = newz1
x1z2 = x1 & z2
anti_comm = (x2 & z1) ⊻ x1z2
s2 += count_ones(cnt2 ⊻ (cnt1 ⊻ newx1 ⊻ newz1 ⊻ x1z2) & anti_comm)
s1 += count_ones(cnt1 ⊻ anti_comm)
end
(s1 ⊻ (s2<<1))&3 # just the 2 bits matter
end
``````

Benchmarks:

``````julia> @btime count_ones_test(\$a2,\$b2) # original size
396.156 μs (0 allocations: 0 bytes)
0x0000000000000003

julia> @btime simd_vec4(\$a2,\$b2)
387.904 μs (0 allocations: 0 bytes)
3

julia> a = rand(UInt64, 8*1_000);

julia> b = rand(UInt64, 8*1_000);

julia> @btime count_ones_test(\$a,\$b)
1.964 μs (0 allocations: 0 bytes)
0x0000000000000003

julia> @btime simd_vec4(\$a,\$b)
1.510 μs (0 allocations: 0 bytes)
3

julia> a = rand(UInt64, 8*100);

julia> b = rand(UInt64, 8*100);

julia> @btime count_ones_test(\$a,\$b)
198.594 ns (0 allocations: 0 bytes)
0x0000000000000003

julia> @btime simd_vec4(\$a,\$b)
151.630 ns (0 allocations: 0 bytes)
3
``````

Different computer:

``````julia> @btime count_ones_test(\$a,\$b)
1.898 ms (0 allocations: 0 bytes)
0x0000000000000001

julia> @btime simd_vec4(\$a,\$b)
1.959 ms (0 allocations: 0 bytes)
1

julia> a = rand(UInt64, 8*1_000);

julia> b = rand(UInt64, 8*1_000);

julia> @btime count_ones_test(\$a,\$b)
6.332 μs (0 allocations: 0 bytes)
0x0000000000000001

julia> @btime simd_vec4(\$a,\$b)
5.290 μs (0 allocations: 0 bytes)
1

julia> a = rand(UInt64, 8*100);

julia> b = rand(UInt64, 8*100);

julia> @btime count_ones_test(\$a,\$b)
629.300 ns (0 allocations: 0 bytes)
0x0000000000000002

julia> @btime simd_vec4(\$a,\$b)
485.497 ns (0 allocations: 0 bytes)
2
``````

Your solution is faster. But LoopVectorization handles `+` correctly.

1 Like

So the trick is to rewrite loops so that in each iteration one does a reduction over the SIMD registers instead of doing the reduction outside of the loop (which requires explicit declaration of a SIMD register outside the loop)? This is really neat!

Could you elaborate, what do you mean by “But LoopVectorization handles + correctly”? Is there a bug in my code that I missed?

I also do confirm the benchmark results.

``````julia> @btime simple(\$a,\$b)
1.411 ms (0 allocations: 0 bytes)

julia> @btime inboundssimd(\$a,\$b)
985.651 μs (0 allocations: 0 bytes)

julia> @btime simd_vec4(\$a,\$b)
679.541 μs (0 allocations: 0 bytes)

julia> @btime lv_count_ones_test(\$a,\$b)
726.223 μs (0 allocations: 0 bytes)
``````

And I assume the fact that xor and count_ones commute is something not currently used in the optimizations done by LoopVectorization?

Yes. It doesn’t know anything about `count_ones` or `xor`, actually. That’s what I meant about handling `+` – LoopVectorization doesn’t know what to do with `xor` or `count_ones`.
Also, note that from code like this:

``````    @turbo for i in 1:len
x1, x2, z1, z2 = l[i], r[i], l[i+len], r[i+len]
r[i] = newx1 = x1 ⊻ x2
r[i+len] = newz1 = z1 ⊻ z2
x1z2 = x1 & z2
anti_comm = (x2 & z1) ⊻ x1z2
cnt2 ⊻= (cnt1 ⊻ newx1 ⊻ newz1 ⊻ x1z2) & anti_comm
cnt1 ⊻= anti_comm
end
``````

LoopVectorization can’t know that `count_ones` is involved at all. It’d have to try and produce `cnt1` and `cnt2` scalars equal to the scalars you get from the serial loop.
Which is why:

From there, a potential future optimization is to have it move the `count_ones` outside of the loop and produce the optimized version.

2 Likes

Thank you, this has been extremely instructive and helpful! Is it on LoopVectorization’s roadmap to permit some form of user provided assumptions? E.g. “assume this function commutes with this associative operator”. That would enable moving count_ones outside of the loop.

There isn’t an official road map, but I think something like that can be added in the future. Feel free to file an issue for tracking it.

1 Like
1 Like