# Writing a SIMD friendly iterate function

Hi everyone,

For a project I’m working on I have a datastructure with an `AbstractArray` interface that internally stores its data in a `Vector` of `Vectors` (more precisly, it uses a fixed-size arrays, but that doesn’t matter for the problem at hand). Now I would like to efficiently iterate over these elements as if it were a single array. Of course I could do this:

``````# For example, compute a sum
function sumiterate_pv(pv::PartitionedVector)
s = zero(eltype(pv))
for part in pv.partitions
for x in part
s += x
end
end
s
end
``````

But this restricts the function to only accept `PartitionedVector` types and I’d like something more general, such as:

``````function sumiterate(A::AbstractVector)
s = zero(eltype(A))
for x in A
s+= x
end
s
end
``````

Now I know about `Iterators.flatten`, but that performs a lot worse than the manually written double loop, so I decided to write my own `iterate` function, in which I try to mimic the double loop behavior. This is already performing a lot better than `Iterators.flatten`, but computing the sum using this `iterate` function is still not as performant as the `sumiterate_pv` example. By looking at the LLVM that is being produced for both versions I can see that `sumiterate_pv` is compiled into SIMD code, which makes sense for computing a sum. However, this does not happen for loops using my custom `iterate` function. I think it has something to do with the fact that the compiler doesn’t really see it as a double loop, but as a single loop with a branch in the body, which makes it hard to vectorize. Does anyone have any idea if I can somehow rewrite my `iterate` function in a way that would compile down to SIMD code? Below is a (simplified) snippet of my code.

``````mutable struct PVIterateState{T}
num_partitions::Int
outer_idx::Int
cur_part::Vector{T}
cur_part_length::Int
inner_idx::Int
end

function Base.iterate(pv::PartitionedVector)
isempty(pv) && return nothing
cur_part = pv.partitions
state = PVIterateState(length(pv.partitions), 1, cur_part, length(cur_part), 2)
(cur_part, state)
end

function Base.iterate(pv::PartitionedVector, s::PVIterateState)
if s.inner_idx <= s.cur_part_length # inner loop
@inbounds res = s.cur_part[s.inner_idx]
s.inner_idx += 1
(res, s)
elseif s.outer_idx < s.num_partitions # outer loop
s.outer_idx += 1
@inbounds s.cur_part = pv.partitions[s.outer_idx]
s.cur_part_length = length(s.cur_part)
@inbounds res = s.cur_part
s.inner_idx = 2
(res, s)
else
nothing
end
end``````
2 Likes

The following should simd (at least with inbounds) if it is applicable to your type and has `@inbounds`. It needs some work to become compatible with offset-arrays.

``````Base.@propagate_inbounds function iterate(pv::PartitionedVector, state=(1,1))
i,j = state
while true
i <= length(pv.partitions) || return nothing
part = pv.partitions[i]
j <= length(part) || (i+= 1; j=1; continue)
return (part[j], (i,j+1))
end
end
``````
1 Like

Thanks for the reply foobar Unfortunately it doesn’t seem to work however. I inspected the generated LLVM and see no SIMD being generated for `sumiterate` using your `iterate` implementation. This is also reflected by benchmarking `sumiterate` and `sumiterate_pv`:

``````@btime sumiterate_pv(pv)
3.021 ms (1 allocation: 16 bytes)

@btime sumiterate(pv)
10.683 ms (1 allocation: 16 bytes)
``````

In fact, that timing is even a bit worse than the solution I had, because your solution basically performs the outer loop boundscheck on every inner loop iteration, which is suboptimal. If I update your code like this I get similiar timings to my original implementation:

``````Base.@propagate_inbounds function Base.iterate(pv::PartitionedVector, state=(1,1,pv.partitions))
i,j,part = state
while true
j <= length(part) && return @inbounds part[j], (i,j+1,part)
i += 1; j = 1
i <= length(pv.partitions) || return nothing
@inbounds part = pv.partitions[i]
end
end

@btime sumiterate(pv)
6.717 ms (1 allocation: 16 bytes)
``````

This does assume that a `PartitionedVector` always has at least one partition, but in my use case that’s something I can get away with. Anyway, unfortunately this is still twice as slow as the “handwritten” double-loop.

You could try

I basically wrote it for use cases like this. Among other functionality, the package defines a `flatview` function and a type `VectorOfVectors <: AbstractVector{<:AbstractVector}` that stores a vector or vectors in a flat vector and so represents a partition it. `flatview` on `VectorOfVectors` is an allocation-free O(1) operation.

3 Likes

As @foobar_lv2 mentioned, you need `@inbounds` in front of `for x in A`. With this change, `sumiterate` and `sumiterate_pv` are equivalent in my benchmark (full code below):

``````pv = PartitionedVector([randn(100) for _ in 1:1000])
@btime sumiterate(\$pv)
#  100.343 μs (0 allocations: 0 bytes)
@btime sumiterate_pv(\$pv)
#  100.288 μs (0 allocations: 0 bytes)
``````

However, `sumiterate_pv` can be even more optimized by simply putting `@simd` in front of `for x in part`:

``````@btime sumiterate_pv_simd(\$pv)
16.979 μs (0 allocations: 0 bytes)
``````

I think it is impossible only with bare minimum iterator protocol because `@simd` macro needs to access the container via index.

However, with transducers (disclaimer: I wrote the package) you can just use `Cat` (equivalent of `Iterators.flatten`) to easily get “native” loop speed (i.e., comparable to `sumiterate_pv`):

``````using Transducers
@btime mapfoldl(Cat(), +, \$pv.partitions)
#  100.513 μs (2 allocations: 32 bytes)
``````

Note that I’m just passing `pv.partitions` which is a plain vector-of-vector. I don’t even need to write complicated `iterate` function.

Also, Transducers.jl will have `simd` option in the next release (it’s already in the current master). By just passing `simd=true`, you get SIMD’ed summation:

``````@btime mapfoldl(Cat(), +, \$pv.partitions; simd=true)
#  18.670 μs (2 allocations: 32 bytes)
``````
Full code
``````struct PartitionedVector{T}
partitions::Vector{Vector{T}}
end

Base.@propagate_inbounds function Base.iterate(pv::PartitionedVector, state=(1,1,pv.partitions))
i,j,part = state
while true
j <= length(part) && return @inbounds part[j], (i,j+1,part)
i += 1; j = 1
i <= length(pv.partitions) || return nothing
@inbounds part = pv.partitions[i]
end
end

Base.length(pv::PartitionedVector) = sum(length, pv.partitions)
Base.eltype(::Type{PartitionedVector{T}}) where T = T

function sumiterate(A)
s = zero(eltype(A))
@inbounds for x in A
s += x
end
s
end

function sumiterate_pv(pv::PartitionedVector)
s = zero(eltype(pv))
for part in pv.partitions
for x in part
s += x
end
end
s
end

function sumiterate_pv_simd(pv::PartitionedVector)
s = zero(eltype(pv))
for part in pv.partitions
@simd for x in part
s += x
end
end
s
end

pv = PartitionedVector([randn(100) for _ in 1:1000])

using BenchmarkTools

@btime sumiterate(\$pv)
@btime sumiterate_pv(\$pv)
@btime sumiterate_pv_simd(\$pv)

using Transducers
@btime mapfoldl(Cat(), +, \$pv.partitions)
@btime mapfoldl(Cat(), +, \$pv.partitions; simd=true)
``````
3 Likes

Interesting, I’m actually getting different results. Putting `@inbounds` in front of `for x in A` doesn’t change anything. I have to say I also don’t see why it should, as the `iterate` function already has `@inbounds` macros wherever appropriate. Also, on my machine `sumiterate_pv` and `sumiterate_pv_simd` are actually the same. Apparently on my machine Julia/LLVM already recognizes that the inner loop in `sumiterate_pv` can be executed using SIMD instructions.

I like the Transducers package though, I will definitely have a look to see if it’s appropriate to my setting (which is a bit more complicated than the `PartitionedVector` example I have posted here).

Ah, yes, you are right. My bad. I was first checking that with @foobar_lv2’s method (which I remember adding `@inbounds` makes a difference) and switched to your method without checking the version without `@inbounds`.