# "Functional State" Iterators & Why isn't this loop as fast as entitlement?

TLDR
High level iterators (values(dict), etc.) are potentially >3x slower than they could be… Why and how could they be improved?

Let me lead with the real near term question:
Why is the “ideal” iterator >3x slower than the “entitlement” iterator:

``````using BenchmarkTools
function functional_value_iterate(s::Int,L::Int,slots,vals)
@assert L < typemax(Int)
if L != typemax(Int)
for i=s:L
if @inbounds slots[i] == 0x1
return (@inbounds vals[i], i+1)
end
end
end
nothing
end
function ideal_iterator_sum(iter::Base.ValueIterator{<:Dict})
ret=0
# Start
d=iter.dict
slots=d.slots
vals=d.vals
L=length(slots)

state=d.idxfloor
next=nothing
@goto iterate
while true
(v, state) = next
ret+=v
@label iterate
next = functional_value_iterate(state,L,slots,vals)
if next === nothing
break
end
end
return ret
end

function entitlement_iterator_sum(iter::Base.ValueIterator{<:Dict})
ret=0

d=iter.dict
slots=d.slots
vals=d.vals
L=length(slots)

v=valtype(d)(0)
for state =  d.idxfloor:L
if @inbounds !(slots[state]==0x1)
continue
end
v=@inbounds vals[state]
ret += v
end
return ret
end
``````

Results:

``````julia> @benchmark ideal_iterator_sum(values(d))
BenchmarkTools.Trial:
memory estimate:  32 bytes
allocs estimate:  2
--------------
minimum time:     2.699 μs (0.00% GC)
median time:      2.916 μs (0.00% GC)
mean time:        3.469 μs (0.00% GC)
maximum time:     59.679 μs (0.00% GC)
--------------
samples:          10000
evals/sample:     9

julia> @benchmark entitlement_iterator_sum(values(d))
BenchmarkTools.Trial:
memory estimate:  32 bytes
allocs estimate:  2
--------------
minimum time:     794.000 ns (0.00% GC)
median time:      860.106 ns (0.00% GC)
mean time:        968.261 ns (1.71% GC)
maximum time:     166.960 μs (99.02% GC)
--------------
samples:          10000
evals/sample:     90

julia>
``````

Background idea:
So I have had a strange obsession recently with the idea that many higher level iterators could be on the order of >6x slower than what their entitlement loops could be.

One thought was could there be an iterator paradigm that with a little magic be easier for the JIT to optimize. Mainly by simplifying the iterate method of higher order iterators to remove any prework that they would have to do.

A solution I thought up is what I am calling a “Functional State” Iterator, with the idea being that any prework that needs to be done in the iterator is done by the initial iterate call and an anonymous function `f(state)` is created to tell the compiler that these values don’t change. This can work with the existing two argument iterate paradigm by being totally encapsulated in the iterator state as a tuple `(f,state)`. With the reuse of the same anonymous function signaling to the compiler, certain optimizations could happen.

So with inlining and magic:
`next = iterate(iter)` would become
`boxed_vars=.... ; next = f(initial_state,boxed_vars...)`
and `next = iterate(iter,state )` becoming `next = f(state,boxed_vars...)`

This simplifies the following transformation:

``````next=iterate(iter)
while next !== nothing
v,state = next
....
next = iterate(iter,state)
end
``````

can be transformed to:

``````boxed_vars = ...
state = initial_state
@goto iterate
while true
v,state = next
....
@label iterate
next=f(state,boxed_vars...)
if next === nothing
break
end
end
``````

The following is the basic implementation:

``````#---------------
struct FunctionalState{I,F<:Function} <: Function
f::F
end

FunctionalState(iter::I, f::F) where {I, F <: Function} = FunctionalState{I,F}(f)
(fs::FunctionalState)(state) = fs.f(state)

# This is the iterate function for every iterator the produces a Functional State.
function Base.iterate(::I,next::Tuple{FunctionalState{I,F},S}) where { I,S, F<:Function}
fs, state = next
state === nothing && return nothing
next = fs.f(state)
next === nothing && return nothing
(val,state) = next
return ( val,( fs, state))
end

struct FunctionalStateValueIterator{T<:AbstractDict}
dict::T
end

Base.length(v::FunctionalStateValueIterator) = length(v.dict)
Base.eltype(v::FunctionalStateValueIterator) = eltype(v.dict)

function functional_state_values(d::Dict)
FunctionalStateValueIterator(d)
end

function functional_value_iterate(s::Int,L::Int,slots,vals)
@assert L < typemax(Int)
if L != typemax(Int)
for i=s:L
if @inbounds slots[i] == 0x1
return (@inbounds vals[i], i+1)
end
end
end
nothing
end

function Base.iterate(iter::FunctionalStateValueIterator{<:Dict})
d=iter.dict
slots=d.slots
vals=d.vals
L=length(slots)

f(i::Int)=functional_value_iterate(i,L,slots,vals)
fs=FunctionalState(iter,f)

iterate(iter,(fs,d.idxfloor))
end
``````

This works and is a little faster than the current iterator but I want it to get closer to entitlement.

3 Likes

My anecdotal evidence that some high level iterators are slow compared to entitlement:

``````using BenchmarkTools
d=Dict(v=>v for v in 1:1000)
function iterator_sum(f,iter)
ret=0
next = iterate(iter)
while next !== nothing
(v, state) = next
ret+=f(v)
next = iterate(iter, state)
end
# for v in iter
#     ret += f(v)
# end
return ret
end
function entitlement_iterator_sum(iter::Base.ValueIterator{<:Dict})
ret=0

d=iter.dict
slots=d.slots
vals=d.vals
L=length(slots)

v=valtype(d)(0)
for state =  d.idxfloor:L
if @inbounds !(slots[state]==0x1)
continue
end
v=@inbounds vals[state]
ret += v
end
return ret
end

@benchmark sum(values(d))
@benchmark iterator_sum(v->v,values(d))
@benchmark entitlement_iterator_sum(values(d))
``````

Results:

``````julia> @benchmark sum(values(d))
BenchmarkTools.Trial:
memory estimate:  32 bytes
allocs estimate:  2
--------------
minimum time:     5.988 μs (0.00% GC)
median time:      6.617 μs (0.00% GC)
mean time:        7.841 μs (0.00% GC)
maximum time:     411.286 μs (0.00% GC)
--------------
samples:          10000
evals/sample:     6

julia> @benchmark iterator_sum(v->v,values(d))
BenchmarkTools.Trial:
memory estimate:  32 bytes
allocs estimate:  2
--------------
minimum time:     7.276 μs (0.00% GC)
median time:      8.681 μs (0.00% GC)
mean time:        9.739 μs (0.00% GC)
maximum time:     398.177 μs (0.00% GC)
--------------
samples:          10000
evals/sample:     3

julia> @benchmark entitlement_iterator_sum(values(d))
BenchmarkTools.Trial:
memory estimate:  32 bytes
allocs estimate:  2
--------------
minimum time:     821.163 ns (0.00% GC)
median time:      875.300 ns (0.00% GC)
mean time:        983.385 ns (0.00% GC)
maximum time:     11.437 μs (0.00% GC)
--------------
samples:          10000
evals/sample:     80
``````

I hadn’t heard the name ‘entitlement’ before; where’s it from?

For comparison, here’s a copy of the relevant code in Base in Julia 1.1 (`dict.jl`):

``````@propagate_inbounds function iterate(v::Union{KeySet{<:Any, <:Dict}, ValueIterator{<:Dict}},
i=v.dict.idxfloor)
i = skip_deleted(v.dict, i)
i > length(v.dict.vals) && return nothing
(v isa KeySet ? v.dict.keys[i] : v.dict.vals[i], i+1)
end

function skip_deleted(h::Dict, i)
L = length(h.slots)
@inbounds while i<=L && !isslotfilled(h,i)
i += 1
end
return i
end

@propagate_inbounds isslotfilled(h::Dict, i::Int) = h.slots[i] == 0x1
``````

So compared to that, you’re manually hoisting the `L = length(h.slots)` outside of the main iteration loop and collapsing the `skip_deleted` while loop into the main iteration loop, correct?

I didn’t read all the code here, but the general issue of how to express efficient iteration kernels in an iterator interface brings to mind the transducer approach as a promising alternative. In particular I thought the following comment about `foldl` implementations was particularly interesting:

See also the documentation section in Transducers.jl regarding the structure of the lowered loops:
https://tkf.github.io/Transducers.jl/dev/#Difference-to-iterators-1

1 Like

@tkoolen in this context entitlement typically means " the amount to which a person has a right", which to say is the best performance one could reasonably expect to get.

You are also very correct in that is what I have done. There is actually a PR that I have #31452 which makes this particular benchmark even faster(2X) for the baseline. But in gerneral I am trying to figure out why the “ideal” example is not getting vectorized like the “entitlement” example.

In this example “entitlement” is just a natural transformation of the “ideal” function, they are both doing the same things.

But like the I said earlier, the general suspicion is there could be on the order of 4x performance for many other high level iterators.

1 Like

@vchuravy in #31416 you were able to figure out why something wasn’t getting a fast vectorization. Any thoughts on why one the “ideal” function above is slower it can’t be vectorized to be as fast as the entitlement version?