New iteration protocol much slower


#1

I want to convert an iteration protocol from 0.6 to 0.7 but I obtain a huge slowdown
On julia 0.6:

function f(iter)
   state = start(iter)
   i = 0
   while !(done(iter, state)) & (i < 1_000_000)
      i += 1
      v, state = next(iter, state)
   end
end
@time f(zeros(1_000_000))
#   0.008861 seconds (2.02 k allocations: 7.734 MiB)

On Julia 0.7

function f(iter)
   x = iterate(iter)
   i = 0
   while (x !== nothing) & (i < 1_000_000)
     i += 1
      v, state = x
      x = iterate(iter, state)
   end
end
@time f(zeros(1_000_000))
#   0.239210 seconds (3.00 M allocations: 68.649 MiB, 1.95% gc time)

I don’t understand why the i < 1_000_000 slows down performances that much.
As a user, I have to say that I am not a fan of this new iteration protocol. The previous iteration protocol was clearer and did not have these performance traps.


#2

The MWE is the following I would say:

function g(iter)
   x = iterate(iter)
   while (x !== nothing) && true
       v, state = x
       x = iterate(iter, state)
   end
end
@btime g(zeros(1_000_000))

and it has the same performance penalty, while already this is fast:

function h(iter)
   x = iterate(iter)
   while true && (x !== nothing)
       v, state = x
       x = iterate(iter, state)
   end
end
@btime h(zeros(1_000_000))

#3

You’re mostly measuring compilation time this way, but there is an actual runtime performance issue.

Try

using BenchmarkTools
@btime f(x) setup = x = zeros(1_000_000)

Result on latest nightly: 297.076 ms (2998980 allocations: 61.02 MiB)
Result on 0.6.3: 812.711 ΞΌs (0 allocations: 0 bytes)

Not sure yet what the issue is.

Without the & (i < 1_000_000) almost the whole function is optimized away on 0.7, it seems.


#4

The compiler currently has some trouble tracking the extra conditional through the branch. I’d recommend either reversing the order of the conditions (and using && as suggested by @bkamins above) or preferably not using the raw iteration protocol directly (e.g. this particular case is equivalent to Iterators.take(x, iter) .


#5

I don’t get it. Why is inference capable of understanding one variant, but not the other? Why are there no phi-nodes in the lowered code? Or is phi-nodification of the lowering an ongoing unfinished project? (sorry for the naive question on the state of the julia-side of the compiler)

julia> a=rand(1000); 
julia> function g_good(iter)
              i=0
              x=iterate(iter)
              @label loopentry
              if true && x!==nothing
              i += 1 
              v,state = x
              x=iterate(iter,state)
              @goto loopentry
              end
              i
              end

julia> @code_lowered g_good(a)
CodeInfo(
2  1 ─       Core.NewvarNode(:(v))                                                                                                  β”‚
   β”‚         Core.NewvarNode(:(state))                                                                                              β”‚
   β”‚         Core.NewvarNode(:(#temp#@_6))                                                                                          β”‚
   β”‚         i = 0                                                                                                                  β”‚
3  └──       x = (Main.iterate)(Core.Compiler.Argument(2))                                                                          β”‚
5  2 β”„       goto 4 if not true                                                                                                     β”‚
   3 ─       #temp#@_8 = Core.Compiler.Argument(7) !== Main.nothing                                                                 β”‚
   └──       goto 5                                                                                                                 β”‚
   4 ─       #temp#@_8 = false                                                                                                      β”‚
   5 β”„       goto 7 if not %%#temp#@_8                                                                                              β”‚
6  6 ─       i = Core.Compiler.Argument(3) + 1                                                                                      β”‚
7  β”‚   %12 = Base.indexed_iterate(%%x, 1)::Any                                                                                      β”‚
   β”‚         v = (Core.getfield)(Core.SSAValue(12), 1)                                                                              β”‚
   β”‚         #temp#@_6 = (Core.getfield)(Core.SSAValue(12), 2)                                                                      β”‚
   β”‚   %15 = Base.indexed_iterate(%%x, 2, %%#temp#@_6)::Any                                                                         β”‚
   β”‚         state = (Core.getfield)(Core.SSAValue(15), 1)                                                                          β”‚
   β”‚         #temp#@_6 = (Core.getfield)(Core.SSAValue(15), 2)                                                                      β”‚
8  β”‚         x = (Main.iterate)(Core.Compiler.Argument(2), Core.Compiler.Argument(5))                                               β”‚
9  └──       goto 2                                                                                                                 β”‚
11 7 ─       return %%i                                                                                                             β”‚
)

julia> function g_bad(iter)
              i=0
              x=iterate(iter)
              @label loopentry
              if  x!==nothing && true
              i += 1 
              v,state = x
              x=iterate(iter,state)
              @goto loopentry
              end
              i
              end
julia> @code_lowered g_bad(a)
CodeInfo(
2  1 ─       Core.NewvarNode(:(v))                                                                                                  β”‚
   β”‚         Core.NewvarNode(:(state))                                                                                              β”‚
   β”‚         Core.NewvarNode(:(#temp#@_6))                                                                                          β”‚
   β”‚         i = 0                                                                                                                  β”‚
3  └──       x = (Main.iterate)(Core.Compiler.Argument(2))                                                                          β”‚
5  2 β”„ %6  = Main.:!==(%%x, Main.nothing)::Any                                                                                      β”‚
   └──       goto 4 if not %6                                                                                                       β”‚
   3 ─       #temp#@_8 = true                                                                                                       β”‚
   └──       goto 5                                                                                                                 β”‚
   4 ─       #temp#@_8 = false                                                                                                      β”‚
   5 β”„       goto 7 if not %%#temp#@_8                                                                                              β”‚
6  6 ─       i = Core.Compiler.Argument(3) + 1                                                                                      β”‚
7  β”‚   %13 = Base.indexed_iterate(%%x, 1)::Any                                                                                      β”‚
   β”‚         v = (Core.getfield)(Core.SSAValue(13), 1)                                                                              β”‚
   β”‚         #temp#@_6 = (Core.getfield)(Core.SSAValue(13), 2)                                                                      β”‚
   β”‚   %16 = Base.indexed_iterate(%%x, 2, %%#temp#@_6)::Any                                                                         β”‚
   β”‚         state = (Core.getfield)(Core.SSAValue(16), 1)                                                                          β”‚
   β”‚         #temp#@_6 = (Core.getfield)(Core.SSAValue(16), 2)                                                                      β”‚
8  β”‚         x = (Main.iterate)(Core.Compiler.Argument(2), Core.Compiler.Argument(5))                                               β”‚
9  └──       goto 2                                                                                                                 β”‚
11 7 ─       return %%i                                                                                                             β”‚
)

#6

If I switch the conditions and use &&, I get,

on v0.6:

julia> function f(iter)                                
          state = start(iter)                          
          i = 0                                        
          while (i < 1_000_000) && !(done(iter, state))
             i += 1                                    
             v, state = next(iter, state)              
          end; return i                                
       end                                             
f (generic function with 1 method)                     
                                                       
julia> @btime f(x) setup = x = zeros(1_000_000)        
  812.663 ΞΌs (0 allocations: 0 bytes)                  
1000000                                                

and on 0.7

julia> function f(iter)
          x = iterate(iter)
          i = 0
          while (i < 1_000_000) && (x !== nothing)
            i += 1
             v, state = x
             x = iterate(iter, state)
          end; return i
       end
f (generic function with 1 method)

julia> @btime f(x) setup = x = zeros(1_000_000)
  541.775 ΞΌs (0 allocations: 0 bytes)
1000000

You need to return i because the whole iteration otherwise gets compiled out (on 0.7).


#7

I do really like the protocol itself, it’s very clean and simple for users, but I must say I am mildly terrified of performance traps. Issues like the one demonstrated here should be prominently documented. It wouldn’t be even close to obvious to me why @2LxtkNuVTZof’s code is so incredibly slow on 0.7.

Out of curiosity, is there any hope of improvement on this particular issue before 1.0?


#8

We currently do a very limited form for backwards type inference at branch points, i.e. something like:

cond = x === nothing
if cond
   # x known to be nothing
end

This currently only works with === and only if that === is fairly directly passed into a branch and not obscured by further processing. There’s ways to do more general backwards inference, but they are not implemented yet.

SSA conversion currently happens after type inference, but before optimizations. In the future it may happen earlier, as part of lowering, but not at the moment.


#9

We needed to get the iteration protocol in because it’s a breaking change. There are various compiler improvements to be done still, but none of them are breaking, so they’re not priorities for 1.0. Until, then we need to be continue being a bit careful with union types in performance sensitive code (just as we need to be careful about type stability). The general guidelines here is that Union{T, Nothing} should be compared using === (or !===), and that the result should be directly used as a condition. In the future the compiler will likely better at this, but for now you’ll need to follow the guidelines for performance. Also, I continue to recommend using some of the higher level iterator abstractions. I know using the raw iteration protocol is tempting, but there are pitfalls (as there were with the old protocol).


#10

It is fairly typical that new features have various optimization pitfalls initially which get filled in over time as people report problematic cases and the compiler is improved to support such cases. After some iteration on this the feature becomes reliably fast. This process takes a little time but happens faster than you might think and soon you’ll forget that the performance was ever unreliable. Bottom line: give it some timeβ€”the new iteration protocol is already better than the old one was initially (which of course, hardly anyone remembers).


#11

Please file an issue about this, and hopefully the performance discrepancy can be fixed in a 1.0.x release or before.