"Functional State" Iterators & Why isn't this loop as fast as entitlement?

TLDR
High level iterators (values(dict), etc.) are potentially >3x slower than they could be… Why and how could they be improved?

Let me lead with the real near term question:
Why is the “ideal” iterator >3x slower than the “entitlement” iterator:

using BenchmarkTools
function functional_value_iterate(s::Int,L::Int,slots,vals)
    @assert L < typemax(Int) 
    if L != typemax(Int)
        for i=s:L
            if @inbounds slots[i] == 0x1
                 return (@inbounds vals[i], i+1)
            end
        end
    end
    nothing
end
function ideal_iterator_sum(iter::Base.ValueIterator{<:Dict})
        ret=0
        # Start
        d=iter.dict
        slots=d.slots
        vals=d.vals
        L=length(slots)

        state=d.idxfloor
        next=nothing
        @goto iterate
        while true
            (v, state) = next
            ret+=v
            @label iterate
            next = functional_value_iterate(state,L,slots,vals)
            if next === nothing
                break
            end
        end
    return ret
end

function entitlement_iterator_sum(iter::Base.ValueIterator{<:Dict})
    ret=0
    
    d=iter.dict
    slots=d.slots
    vals=d.vals
    L=length(slots)
    
    v=valtype(d)(0)
    for state =  d.idxfloor:L
        if @inbounds !(slots[state]==0x1)
            continue
        end
        v=@inbounds vals[state]
        ret += v
    end
    return ret
end

Results:

julia> @benchmark ideal_iterator_sum(values(d))
BenchmarkTools.Trial: 
  memory estimate:  32 bytes
  allocs estimate:  2
  --------------
  minimum time:     2.699 μs (0.00% GC)
  median time:      2.916 μs (0.00% GC)
  mean time:        3.469 μs (0.00% GC)
  maximum time:     59.679 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     9

julia> @benchmark entitlement_iterator_sum(values(d))
BenchmarkTools.Trial: 
  memory estimate:  32 bytes
  allocs estimate:  2
  --------------
  minimum time:     794.000 ns (0.00% GC)
  median time:      860.106 ns (0.00% GC)
  mean time:        968.261 ns (1.71% GC)
  maximum time:     166.960 μs (99.02% GC)
  --------------
  samples:          10000
  evals/sample:     90

julia> 

Background idea:
So I have had a strange obsession recently with the idea that many higher level iterators could be on the order of >6x slower than what their entitlement loops could be.

One thought was could there be an iterator paradigm that with a little magic be easier for the JIT to optimize. Mainly by simplifying the iterate method of higher order iterators to remove any prework that they would have to do.

A solution I thought up is what I am calling a “Functional State” Iterator, with the idea being that any prework that needs to be done in the iterator is done by the initial iterate call and an anonymous function f(state) is created to tell the compiler that these values don’t change. This can work with the existing two argument iterate paradigm by being totally encapsulated in the iterator state as a tuple (f,state). With the reuse of the same anonymous function signaling to the compiler, certain optimizations could happen.

So with inlining and magic:
next = iterate(iter) would become
boxed_vars=.... ; next = f(initial_state,boxed_vars...)
and next = iterate(iter,state ) becoming next = f(state,boxed_vars...)

This simplifies the following transformation:

next=iterate(iter)
while next !== nothing
    v,state = next
    ....
    next = iterate(iter,state)
end

can be transformed to:

boxed_vars = ...
state = initial_state
@goto iterate
while true
    v,state = next
    ....
    @label iterate
    next=f(state,boxed_vars...)
    if next === nothing
          break
    end
end

The following is the basic implementation:

#---------------
struct FunctionalState{I,F<:Function} <: Function
    f::F
end

FunctionalState(iter::I, f::F) where {I, F <: Function} = FunctionalState{I,F}(f)
(fs::FunctionalState)(state) = fs.f(state)


# This is the iterate function for every iterator the produces a Functional State.
function Base.iterate(::I,next::Tuple{FunctionalState{I,F},S}) where { I,S, F<:Function}
    fs, state = next
    state === nothing && return nothing
    next = fs.f(state)
    next === nothing && return nothing
    (val,state) = next
    return ( val,( fs, state))
end

struct FunctionalStateValueIterator{T<:AbstractDict}
    dict::T
end

Base.length(v::FunctionalStateValueIterator) = length(v.dict)
Base.eltype(v::FunctionalStateValueIterator) = eltype(v.dict)

function functional_state_values(d::Dict)
    FunctionalStateValueIterator(d)
end

function functional_value_iterate(s::Int,L::Int,slots,vals)
    @assert L < typemax(Int) 
    if L != typemax(Int)
        for i=s:L
            if @inbounds slots[i] == 0x1
                 return (@inbounds vals[i], i+1)
            end
        end
    end
    nothing
end

function Base.iterate(iter::FunctionalStateValueIterator{<:Dict})
    d=iter.dict
    slots=d.slots
    vals=d.vals
    L=length(slots)
    
    f(i::Int)=functional_value_iterate(i,L,slots,vals)
    fs=FunctionalState(iter,f)
    
    iterate(iter,(fs,d.idxfloor))
end

This works and is a little faster than the current iterator but I want it to get closer to entitlement.

3 Likes

My anecdotal evidence that some high level iterators are slow compared to entitlement:

using BenchmarkTools
d=Dict(v=>v for v in 1:1000)
function iterator_sum(f,iter)    
    ret=0
    next = iterate(iter)
    while next !== nothing
        (v, state) = next
        ret+=f(v)
        next = iterate(iter, state)
    end
    # for v in iter
    #     ret += f(v)
    # end
    return ret
end
function entitlement_iterator_sum(iter::Base.ValueIterator{<:Dict})
    ret=0
    
    d=iter.dict
    slots=d.slots
    vals=d.vals
    L=length(slots)
    
    v=valtype(d)(0)
    for state =  d.idxfloor:L
        if @inbounds !(slots[state]==0x1)
            continue
        end
        v=@inbounds vals[state]
        ret += v
    end
    return ret
end

@benchmark sum(values(d))
@benchmark iterator_sum(v->v,values(d))
@benchmark entitlement_iterator_sum(values(d))

Results:

julia> @benchmark sum(values(d))
BenchmarkTools.Trial: 
  memory estimate:  32 bytes
  allocs estimate:  2
  --------------
  minimum time:     5.988 μs (0.00% GC)
  median time:      6.617 μs (0.00% GC)
  mean time:        7.841 μs (0.00% GC)
  maximum time:     411.286 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     6

julia> @benchmark iterator_sum(v->v,values(d))
BenchmarkTools.Trial: 
  memory estimate:  32 bytes
  allocs estimate:  2
  --------------
  minimum time:     7.276 μs (0.00% GC)
  median time:      8.681 μs (0.00% GC)
  mean time:        9.739 μs (0.00% GC)
  maximum time:     398.177 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     3

julia> @benchmark entitlement_iterator_sum(values(d))
BenchmarkTools.Trial: 
  memory estimate:  32 bytes
  allocs estimate:  2
  --------------
  minimum time:     821.163 ns (0.00% GC)
  median time:      875.300 ns (0.00% GC)
  mean time:        983.385 ns (0.00% GC)
  maximum time:     11.437 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     80

I hadn’t heard the name ‘entitlement’ before; where’s it from?

For comparison, here’s a copy of the relevant code in Base in Julia 1.1 (dict.jl):

@propagate_inbounds function iterate(v::Union{KeySet{<:Any, <:Dict}, ValueIterator{<:Dict}},
                                     i=v.dict.idxfloor)
    i = skip_deleted(v.dict, i)
    i > length(v.dict.vals) && return nothing
    (v isa KeySet ? v.dict.keys[i] : v.dict.vals[i], i+1)
end

function skip_deleted(h::Dict, i)
    L = length(h.slots)
    @inbounds while i<=L && !isslotfilled(h,i)
        i += 1
    end
    return i
end

@propagate_inbounds isslotfilled(h::Dict, i::Int) = h.slots[i] == 0x1

So compared to that, you’re manually hoisting the L = length(h.slots) outside of the main iteration loop and collapsing the skip_deleted while loop into the main iteration loop, correct?

I didn’t read all the code here, but the general issue of how to express efficient iteration kernels in an iterator interface brings to mind the transducer approach as a promising alternative. In particular I thought the following comment about foldl implementations was particularly interesting:

See also the documentation section in Transducers.jl regarding the structure of the lowered loops:
https://tkf.github.io/Transducers.jl/dev/#Difference-to-iterators-1

1 Like

@tkoolen in this context entitlement typically means " the amount to which a person has a right", which to say is the best performance one could reasonably expect to get.

You are also very correct in that is what I have done. There is actually a PR that I have #31452 which makes this particular benchmark even faster(2X) for the baseline. But in gerneral I am trying to figure out why the “ideal” example is not getting vectorized like the “entitlement” example.

In this example “entitlement” is just a natural transformation of the “ideal” function, they are both doing the same things.

But like the I said earlier, the general suspicion is there could be on the order of 4x performance for many other high level iterators.
https://github.com/JuliaLang/julia/pull/31452

1 Like

@vchuravy in #31416 you were able to figure out why something wasn’t getting a fast vectorization. Any thoughts on why one the “ideal” function above is slower it can’t be vectorized to be as fast as the entitlement version?