Tail-call optimization and function-barrier -based accumulation in loops

tkf · June 29, 2019, 4:20am

Tail-call optimization/elimination was brought up in several threads like this one in discourse and this one in GitHub. The main interest seems to be purely based on the nice-to-have performance benefit and the core devs seem to be open to the change while assigning very low priority because writing for/while-loop is more idiomatic in Julia.

But I think there is another point of view related to function-barrier -based accumulation which makes me wonder if the tail-call optimization actually makes sense in Julia. By “function-barrier -based accumulation” I mean the pattern in Julia to invoke recursion in the middle of the loop when the type of an accumulation variable is changed or required to change. This pattern is used, e.g., in the implementation of collect

github.com

JuliaLang/julia/blob/cef655a68a4e824cbe7cca582256d61aa72eb725/base/array.jl#L646-L663


      
          function collect_to!(dest::AbstractArray{T}, itr, offs, st) where T
              # collect to dest array, checking the type of each result. if a result does not
              # match, widen the result type and re-dispatch.
              i = offs
              while true
                  y = iterate(itr, st)
                  y === nothing && break
                  el, st = y
                  if el isa T || typeof(el) === T
                      @inbounds dest[i] = el::T
                      i += 1
                  else
                      new = setindex_widen_up_to(dest, el, i)
                      return collect_to!(new, itr, i+1, st)
                  end
              end
              return dest
          end

As you can see from the quoted example, this kind of recursion happens at the tail-call position because the whole purpose of this is to resume the loop while introducing the function-barrier.

Generalizing this idea, I was tempted to define a foldl implementing the function-barrier -based type widening/refinement:

function foldlfb(op, init, iterator)
    ret = iterate(iterator)
    ret === nothing && return init
    x, state = ret
    return foldlfb_impl(op, op(init, x), iterator, state)
end

function foldlfb_impl(op, acc::T, iterator, state) where T
    while (ret = iterate(iterator, state)) !== nothing
        x, state = ret
        y = op(acc, x)
        y isa T || return foldlfb_impl(op, y, iterator, state)
        acc = y
    end
    return acc
end

This works well for the case like collect

julia> foldlfb(vcat, Union{}[], [1.0, missing, nothing])
3-element Array{Union{Missing, Nothing, Float64},1}:
 1.0
  missing
  nothing

(Of course, vcat is not practically useful because it copies the array all the time. One would need something like store!_or_widen by @Tamas_Papp to get a good performance.)

However, this foldlfb works nicely only if the reducing function op “monotonically” (in a loose sense) changes the output type. For example, vcat always increase the eltype and that’s why I described that it “monotonically” changes the type. However, not all functions have this property. A simple example would be:

julia> typechanging(::Any, x) = Val(x % 3)
typechanging (generic function with 1 method)

julia> foldlfb(typechanging, Val(0), 1:10)
Val{1}()

julia> foldlfb(typechanging, Val(0), 1:10^4)
ERROR: StackOverflowError:

So, this naive implementation of foldlfb is not robust enough for practice usage.

Notice that tail-call optimization could have eliminated the StackOverflowError. It would likely generate a non-type stable code which can be fine in many cases. If a user is passing type changing op it’s likely that what s/he is expecting.

This led me to the idea that:

Tail-call optimization actually makes sense in Julia because the function-barrier is a common idiom to drop into a highly optimized computing kernel in the middle of dynamic code. So, why not making easy to drop into highly optimized loop during initially non-type stable loop?

Of course, this idea would be irrelevant if Julia can automatically optimize a loop

for x in iterator
    acc = op(acc, x)
end

to implement the strategy that is manually done using function-barrier. If there are optimizations that can be done with tail-calls then maybe these are already doable in loops? At the same time, I wonder if the tail calls can be used as a way to communicate with the compiler that the programmer is writing a loop that may be type-stabilized at some point.

A very mechanical and naive transformation you can do to foldlfb_impl is to add a counter to limit the tail calls

function foldlfb_impl(op, acc::T, iterator, state, limit::Val{n}=Val(10)) where {T, n}
    while (ret = iterate(iterator, state)) !== nothing
        x, state = ret
        y = op(acc, x)
        n === 0 || y isa T ||
            return foldlfb_impl(op, y, iterator, state, Val(n-1))
        acc = y
    end
    return acc
end

This lets you execute foldlfb(typechanging, Val(0), 1:10^4) now. As the above transformation is very mechanical, I’d imagine that foldlfb_impl defined using a tail call like the following can be transformed into the above version using the loop with limited recursion.

function foldlfb_impl(op, acc, iterator, state)
    ret = iterate(iterator, state)
    ret === nothing && return acc
    x, state = ret
    return foldlfb_impl(op, op(acc, x), iterator, state)
end

Does such optimization make sense? Would it be useful? Is it implementable in more general cases that the compiler has to handle? Are there any better strategies than simply limiting the recursions?

Tamas_Papp · June 29, 2019, 6:12am

I think you bring up a very important issue. I find that avoiding or minimizing calculations on types is the key to writing performant and idiomatic Julia code, since type calculations is something that is best left to the compiler.

In the store_or_widen! example you linked, this is done by widening types on demand, otherwise accumulating into a container with a fixed type. I find that the compiler handles this very well, since the types are always stable within a function, as widening always calls another function.

I agree that TCO could make code like this more elegant, but I think that TCO is merely a special case of a more general pattern that can be implemented using existing constructs: all that this pattern requires is that type changes always break out into another function, regardless of their position. I would suggest that we explore existing options instead of insisting on the compiler formally handling TCO, as I imagine this could preclude further optimizations (ie the compiler giving up in nasty cases).

Incidentally, 1.2 improved a lot on how this is handled, among other things.

tkf · June 29, 2019, 7:37am

The observation I wanted to emphasize was that this does not work (i.e., overflows the stack) if you apply this to the function that has to break out in the middle of the loop and the way that the loop body changes the type is non-monotonic.

Then, I think the question will be that if the pattern for writing loops and their abstractions is worth a special treatment in the compiler. IIUC, the current iteration protocol is developed hand in hand with the small union optimization in the compiler. So maybe supporting another looping construct makes sense as well? OK, small union is important for other contexts so the analogy is not very direct. Even so, type stable efficient loop sounds like enough motivation for compiler optimization if there is no easy way to do it.

Tamas_Papp · June 29, 2019, 7:50am

Ideally the number of times you have to do this is limited, and 0 for type stable code, 1-2 for small Unions, etc. With extending types, you should quickly reach Any.

Contrived examples aside, I don’t think that stack overflows should be a practical concern. If they are, one should build in a limit like your code above. Designing for the arbitrary depth case may be not be a good trade-off between compilation and runtime.

tkf · June 29, 2019, 8:41am

I get that the example using Val was contrived. I wanted to give a very simple demonstration. Here is less contrived example that still throws a StackOverflowError

badthreshold(::Any, x) =
    if x > 1
        missing
    elseif x > 0
        x
    else
        0
    end
xs = [-1, 0.5, 2][mod1.(1:10^5, 3)]
foldlfb(badthreshold, nothing, xs)

My concern is coming from Transducers.jl which implements various foldl and transformations to compose op. I don’t think it is crazy to assume that acc may change type every few steps with some non type stable functions provided by users. Of course, “wait for practical problems” is the best way for situation like this and that’s what I’ll do anyway. I just thought it also makes sense to start discussion on the potential problem and opportunity, especially because it seemed that this point was missing from the previous discussions.

Tamas_Papp · June 29, 2019, 8:49am

I don’t think it is crazy either. But neither I am convinced that Julia should strive to support programming styles like this, given the cost (in terms of developer and compile time).

My understanding is that it is possible to come up with examples that feature a combinatoric explosion of types in many contexts. I think in most cases, the right solution is to just bail to dynamic dispatch and emit “suboptimal” code, and make the user redesign the approach to suit Julia better.

tkf · June 29, 2019, 7:37pm

Usually, you have to opt-out suboptimal code with many Anys to make your code faster. The problem here is unique because you have to opt-in suboptimal code.

I think automatically degrading to dynamic code is the basic property of Julia. That’s how you mock up things fast and then improve the hot-spot later. This property is broken when you combine loop and function-barrier but TCO can rescue it.

I am also not sure if this is worth the effort. As I said in the OP, if this kind of optimization can happen in a naive loop without function-barrier, that would be much better. Having said that, I am curious to know if I managed to increase the priority of TCO (even slightly).

tkf · September 15, 2019, 1:47am

I just find that Guy Steele is discussing something very similar in the first half of this talk:

ninjaaron · September 25, 2019, 5:44pm

Just want to chime in that, though I can’t contribute much to this thread, I have enjoyed reading it very much. TCO is relevant to my interests. Very interesting conversation!

Olof_Salberger · October 10, 2019, 5:56am

Just also chiming in that if TCO is enabled, in a language with macros you will immediately start seeing abstractions that rely on translating to CPS and making use of it (not necessarily a bad thing and the fact that we have green threads in the language already makes it less likely to be “abused” for things like being able to do even basic async programming in the old python versions where this was a hot issue).

Also, there’s the old python objection about stack frames & stack traces. That’s fixable by introducing a specific keyword (like @tailcall) for any desired tail recursion instead of having it be default, if you view it as a problem.

Although, if you are using a keyword, then the tail call optimization functionality can be implemented with a trampoline macro.

tkf · October 10, 2019, 6:32am

That’s actually the next thing I wanted to look at. When combined with Union splitting, I was thinking that it may compile to the code as good as function-barrier (or something that is good enough). But this probably still is sub-optimal compared to the finite state machine the compiler can in principle construct (as explained in Guy Steele’s talk). I don’t know how much it matters, though.

anon92994695 · January 15, 2021, 11:57pm

Has there been any lift in this area? I mean, it maybe is a lot of work, but - lots of algorithms are so easy to express recursively.

Oscar_Smith · January 15, 2021, 11:58pm

I don’t think there has. That said, it wouldn’t be too hard to write a macro that re-writes tail calls as a loop.

tkf · January 16, 2021, 3:20am

I’d say using TCE for writing a simple loop is not really a point of my opening post. The point (which became clear to me after watching Guy Steele’s talk) is rather to use it to build FSM based on Julia’s method dispatch. This is not doable with macros and I think there’s some chance that it is attractive enough for convincing Julia devs.

Topic		Replies	Views
Tail-call recursion Performance	41	3929	December 15, 2022
[RFC/ANN] FLoops.jl: fast generic for loops (foldl for humans™) Package Announcements	42	8508	April 29, 2020
Recursive call vs while loop Performance recursion	21	12265	April 12, 2022
Performance comparison, Julia is a slower than Mojo, am I doing something wrong? Performance question , mojo	44	8036	September 28, 2023
Yet another TCO thread Performance recursion	5	2399	May 25, 2022

Tail-call optimization and function-barrier -based accumulation in loops

Related topics