Writing a SIMD friendly iterate function

Hi everyone,

For a project I’m working on I have a datastructure with an AbstractArray interface that internally stores its data in a Vector of Vectors (more precisly, it uses a fixed-size arrays, but that doesn’t matter for the problem at hand). Now I would like to efficiently iterate over these elements as if it were a single array. Of course I could do this:

# For example, compute a sum
function sumiterate_pv(pv::PartitionedVector)
    s = zero(eltype(pv))
    for part in pv.partitions
        for x in part
            s += x
        end
    end
    s
end

But this restricts the function to only accept PartitionedVector types and I’d like something more general, such as:

function sumiterate(A::AbstractVector)
    s = zero(eltype(A))
    for x in A
        s+= x
    end
    s
end

Now I know about Iterators.flatten, but that performs a lot worse than the manually written double loop, so I decided to write my own iterate function, in which I try to mimic the double loop behavior. This is already performing a lot better than Iterators.flatten, but computing the sum using this iterate function is still not as performant as the sumiterate_pv example. By looking at the LLVM that is being produced for both versions I can see that sumiterate_pv is compiled into SIMD code, which makes sense for computing a sum. However, this does not happen for loops using my custom iterate function. I think it has something to do with the fact that the compiler doesn’t really see it as a double loop, but as a single loop with a branch in the body, which makes it hard to vectorize. Does anyone have any idea if I can somehow rewrite my iterate function in a way that would compile down to SIMD code? Below is a (simplified) snippet of my code.

mutable struct PVIterateState{T}
    num_partitions::Int
    outer_idx::Int
    cur_part::Vector{T}
    cur_part_length::Int
    inner_idx::Int
end

function Base.iterate(pv::PartitionedVector)
    isempty(pv) && return nothing
    cur_part = pv.partitions[1]
    state = PVIterateState(length(pv.partitions), 1, cur_part, length(cur_part), 2)
    (cur_part[1], state)
end

function Base.iterate(pv::PartitionedVector, s::PVIterateState)
    if s.inner_idx <= s.cur_part_length # inner loop
        @inbounds res = s.cur_part[s.inner_idx]
        s.inner_idx += 1
        (res, s)
    elseif s.outer_idx < s.num_partitions # outer loop
        s.outer_idx += 1
        @inbounds s.cur_part = pv.partitions[s.outer_idx]
        s.cur_part_length = length(s.cur_part)
        @inbounds res = s.cur_part[1]
        s.inner_idx = 2
        (res, s)
    else
        nothing
    end
end
2 Likes

The following should simd (at least with inbounds) if it is applicable to your type and has @inbounds. It needs some work to become compatible with offset-arrays.

Base.@propagate_inbounds function iterate(pv::PartitionedVector, state=(1,1))
  i,j = state
  while true
    i <= length(pv.partitions) || return nothing
    part = pv.partitions[i]
    j <= length(part) || (i+= 1; j=1; continue)
    return (part[j], (i,j+1))
  end
end
1 Like

Thanks for the reply foobar :slight_smile: Unfortunately it doesn’t seem to work however. I inspected the generated LLVM and see no SIMD being generated for sumiterate using your iterate implementation. This is also reflected by benchmarking sumiterate and sumiterate_pv:

@btime sumiterate_pv(pv)
  3.021 ms (1 allocation: 16 bytes)

@btime sumiterate(pv)
  10.683 ms (1 allocation: 16 bytes)

In fact, that timing is even a bit worse than the solution I had, because your solution basically performs the outer loop boundscheck on every inner loop iteration, which is suboptimal. If I update your code like this I get similiar timings to my original implementation:

Base.@propagate_inbounds function Base.iterate(pv::PartitionedVector, state=(1,1,pv.partitions[1]))
    i,j,part = state
    while true
        j <= length(part) && return @inbounds part[j], (i,j+1,part)
        i += 1; j = 1
        i <= length(pv.partitions) || return nothing
        @inbounds part = pv.partitions[i]
    end
end

@btime sumiterate(pv)
  6.717 ms (1 allocation: 16 bytes)

This does assume that a PartitionedVector always has at least one partition, but in my use case that’s something I can get away with. Anyway, unfortunately this is still twice as slow as the “handwritten” double-loop.

You could try

I basically wrote it for use cases like this. Among other functionality, the package defines a flatview function and a type VectorOfVectors <: AbstractVector{<:AbstractVector} that stores a vector or vectors in a flat vector and so represents a partition it. flatview on VectorOfVectors is an allocation-free O(1) operation.

3 Likes

As @foobar_lv2 mentioned, you need @inbounds in front of for x in A. With this change, sumiterate and sumiterate_pv are equivalent in my benchmark (full code below):

pv = PartitionedVector([randn(100) for _ in 1:1000])
@btime sumiterate($pv)
#  100.343 μs (0 allocations: 0 bytes)
@btime sumiterate_pv($pv)
#  100.288 μs (0 allocations: 0 bytes)

However, sumiterate_pv can be even more optimized by simply putting @simd in front of for x in part:

@btime sumiterate_pv_simd($pv)
  16.979 μs (0 allocations: 0 bytes)

I think it is impossible only with bare minimum iterator protocol because @simd macro needs to access the container via index.

However, with transducers (disclaimer: I wrote the package) you can just use Cat (equivalent of Iterators.flatten) to easily get “native” loop speed (i.e., comparable to sumiterate_pv):

using Transducers
@btime mapfoldl(Cat(), +, $pv.partitions)
#  100.513 μs (2 allocations: 32 bytes)

Note that I’m just passing pv.partitions which is a plain vector-of-vector. I don’t even need to write complicated iterate function.

Also, Transducers.jl will have simd option in the next release (it’s already in the current master). By just passing simd=true, you get SIMD’ed summation:

@btime mapfoldl(Cat(), +, $pv.partitions; simd=true)
#  18.670 μs (2 allocations: 32 bytes)
Full code
struct PartitionedVector{T}
    partitions::Vector{Vector{T}}
end

Base.@propagate_inbounds function Base.iterate(pv::PartitionedVector, state=(1,1,pv.partitions[1]))
    i,j,part = state
    while true
        j <= length(part) && return @inbounds part[j], (i,j+1,part)
        i += 1; j = 1
        i <= length(pv.partitions) || return nothing
        @inbounds part = pv.partitions[i]
    end
end

Base.length(pv::PartitionedVector) = sum(length, pv.partitions)
Base.eltype(::Type{PartitionedVector{T}}) where T = T

function sumiterate(A)
    s = zero(eltype(A))
    @inbounds for x in A
        s += x
    end
    s
end

function sumiterate_pv(pv::PartitionedVector)
    s = zero(eltype(pv))
    for part in pv.partitions
        for x in part
            s += x
        end
    end
    s
end

function sumiterate_pv_simd(pv::PartitionedVector)
    s = zero(eltype(pv))
    for part in pv.partitions
        @simd for x in part
            s += x
        end
    end
    s
end

pv = PartitionedVector([randn(100) for _ in 1:1000])

using BenchmarkTools

@btime sumiterate($pv)
@btime sumiterate_pv($pv)
@btime sumiterate_pv_simd($pv)

using Transducers
@btime mapfoldl(Cat(), +, $pv.partitions)
@btime mapfoldl(Cat(), +, $pv.partitions; simd=true)
3 Likes

Interesting, I’m actually getting different results. Putting @inbounds in front of for x in A doesn’t change anything. I have to say I also don’t see why it should, as the iterate function already has @inbounds macros wherever appropriate. Also, on my machine sumiterate_pv and sumiterate_pv_simd are actually the same. Apparently on my machine Julia/LLVM already recognizes that the inner loop in sumiterate_pv can be executed using SIMD instructions.

I like the Transducers package though, I will definitely have a look to see if it’s appropriate to my setting (which is a bit more complicated than the PartitionedVector example I have posted here).

Ah, yes, you are right. My bad. I was first checking that with @foobar_lv2’s method (which I remember adding @inbounds makes a difference) and switched to your method without checking the version without @inbounds.