I want to convert an iteration protocol from 0.6 to 0.7 but I obtain a huge slowdown
On julia 0.6:
function f(iter)
state = start(iter)
i = 0
while !(done(iter, state)) & (i < 1_000_000)
i += 1
v, state = next(iter, state)
end
end
@time f(zeros(1_000_000))
# 0.008861 seconds (2.02 k allocations: 7.734 MiB)
On Julia 0.7
function f(iter)
x = iterate(iter)
i = 0
while (x !== nothing) & (i < 1_000_000)
i += 1
v, state = x
x = iterate(iter, state)
end
end
@time f(zeros(1_000_000))
# 0.239210 seconds (3.00 M allocations: 68.649 MiB, 1.95% gc time)
I donβt understand why the i < 1_000_000 slows down performances that much.
As a user, I have to say that I am not a fan of this new iteration protocol. The previous iteration protocol was clearer and did not have these performance traps.
The compiler currently has some trouble tracking the extra conditional through the branch. Iβd recommend either reversing the order of the conditions (and using && as suggested by @bkamins above) or preferably not using the raw iteration protocol directly (e.g. this particular case is equivalent to Iterators.take(x, iter) .
I donβt get it. Why is inference capable of understanding one variant, but not the other? Why are there no phi-nodes in the lowered code? Or is phi-nodification of the lowering an ongoing unfinished project? (sorry for the naive question on the state of the julia-side of the compiler)
julia> a=rand(1000);
julia> function g_good(iter)
i=0
x=iterate(iter)
@label loopentry
if true && x!==nothing
i += 1
v,state = x
x=iterate(iter,state)
@goto loopentry
end
i
end
julia> @code_lowered g_good(a)
CodeInfo(
2 1 β Core.NewvarNode(:(v)) β
β Core.NewvarNode(:(state)) β
β Core.NewvarNode(:(#temp#@_6)) β
β i = 0 β
3 βββ x = (Main.iterate)(Core.Compiler.Argument(2)) β
5 2 β goto 4 if not true β
3 β #temp#@_8 = Core.Compiler.Argument(7) !== Main.nothing β
βββ goto 5 β
4 β #temp#@_8 = false β
5 β goto 7 if not %%#temp#@_8 β
6 6 β i = Core.Compiler.Argument(3) + 1 β
7 β %12 = Base.indexed_iterate(%%x, 1)::Any β
β v = (Core.getfield)(Core.SSAValue(12), 1) β
β #temp#@_6 = (Core.getfield)(Core.SSAValue(12), 2) β
β %15 = Base.indexed_iterate(%%x, 2, %%#temp#@_6)::Any β
β state = (Core.getfield)(Core.SSAValue(15), 1) β
β #temp#@_6 = (Core.getfield)(Core.SSAValue(15), 2) β
8 β x = (Main.iterate)(Core.Compiler.Argument(2), Core.Compiler.Argument(5)) β
9 βββ goto 2 β
11 7 β return %%i β
)
julia> function g_bad(iter)
i=0
x=iterate(iter)
@label loopentry
if x!==nothing && true
i += 1
v,state = x
x=iterate(iter,state)
@goto loopentry
end
i
end
julia> @code_lowered g_bad(a)
CodeInfo(
2 1 β Core.NewvarNode(:(v)) β
β Core.NewvarNode(:(state)) β
β Core.NewvarNode(:(#temp#@_6)) β
β i = 0 β
3 βββ x = (Main.iterate)(Core.Compiler.Argument(2)) β
5 2 β %6 = Main.:!==(%%x, Main.nothing)::Any β
βββ goto 4 if not %6 β
3 β #temp#@_8 = true β
βββ goto 5 β
4 β #temp#@_8 = false β
5 β goto 7 if not %%#temp#@_8 β
6 6 β i = Core.Compiler.Argument(3) + 1 β
7 β %13 = Base.indexed_iterate(%%x, 1)::Any β
β v = (Core.getfield)(Core.SSAValue(13), 1) β
β #temp#@_6 = (Core.getfield)(Core.SSAValue(13), 2) β
β %16 = Base.indexed_iterate(%%x, 2, %%#temp#@_6)::Any β
β state = (Core.getfield)(Core.SSAValue(16), 1) β
β #temp#@_6 = (Core.getfield)(Core.SSAValue(16), 2) β
8 β x = (Main.iterate)(Core.Compiler.Argument(2), Core.Compiler.Argument(5)) β
9 βββ goto 2 β
11 7 β return %%i β
)
julia> function f(iter)
state = start(iter)
i = 0
while (i < 1_000_000) && !(done(iter, state))
i += 1
v, state = next(iter, state)
end; return i
end
f (generic function with 1 method)
julia> @btime f(x) setup = x = zeros(1_000_000)
812.663 ΞΌs (0 allocations: 0 bytes)
1000000
and on 0.7
julia> function f(iter)
x = iterate(iter)
i = 0
while (i < 1_000_000) && (x !== nothing)
i += 1
v, state = x
x = iterate(iter, state)
end; return i
end
f (generic function with 1 method)
julia> @btime f(x) setup = x = zeros(1_000_000)
541.775 ΞΌs (0 allocations: 0 bytes)
1000000
You need to return i because the whole iteration otherwise gets compiled out (on 0.7).
I do really like the protocol itself, itβs very clean and simple for users, but I must say I am mildly terrified of performance traps. Issues like the one demonstrated here should be prominently documented. It wouldnβt be even close to obvious to me why @matthieuβs code is so incredibly slow on 0.7.
Out of curiosity, is there any hope of improvement on this particular issue before 1.0?
We currently do a very limited form for backwards type inference at branch points, i.e. something like:
cond = x === nothing
if cond
# x known to be nothing
end
This currently only works with === and only if that === is fairly directly passed into a branch and not obscured by further processing. Thereβs ways to do more general backwards inference, but they are not implemented yet.
SSA conversion currently happens after type inference, but before optimizations. In the future it may happen earlier, as part of lowering, but not at the moment.
We needed to get the iteration protocol in because itβs a breaking change. There are various compiler improvements to be done still, but none of them are breaking, so theyβre not priorities for 1.0. Until, then we need to be continue being a bit careful with union types in performance sensitive code (just as we need to be careful about type stability). The general guidelines here is that Union{T, Nothing} should be compared using === (or !===), and that the result should be directly used as a condition. In the future the compiler will likely better at this, but for now youβll need to follow the guidelines for performance. Also, I continue to recommend using some of the higher level iterator abstractions. I know using the raw iteration protocol is tempting, but there are pitfalls (as there were with the old protocol).
It is fairly typical that new features have various optimization pitfalls initially which get filled in over time as people report problematic cases and the compiler is improved to support such cases. After some iteration on this the feature becomes reliably fast. This process takes a little time but happens faster than you might think and soon youβll forget that the performance was ever unreliable. Bottom line: give it some timeβthe new iteration protocol is already better than the old one was initially (which of course, hardly anyone remembers).