[RFC/ANN] FLoops.jl: fast generic for loops (foldl for humans™)

tkf · April 22, 2020, 1:19am

Yeah, foldl naturally introduces a function barrier in places where you want things to be fast. Though I think this is kind of by accident. It is easy to do this for normal for loops, too:

barrier(f, args...) = f(args...)

barrier(Tables.namedtupleiterator(df)) do itr
    for row in itr
        ...
    end
end

c42f · April 22, 2020, 2:16am

It’s easy to do for normal for loops but the syntax isn’t very natural and there’s no way this would be obvious to new users. It sure would be neat if the naive syntax just worked and was fast!

Is it just a happy coincidence or is there something more interesting going on? I’m sure I don’t know don’t know the answer It’s definitely interesting that a foldl-based lowering would remove the requirement for iterable eltype to be inferrable from the type of any iterable which people want to be fast.

tkf · April 22, 2020, 2:36am

BTW, just to clarify the details (as people reading this out of context may expect too much from foldl-based loop), @floop for row in eachrow(df) would not be specialized for row type and one has to use @floop for row in namedtupleiterator(df) to get a specialized loop. But yes, it’s nice that you don’t need extra function barrier there.

I guess I just wanted to reemphasize that “zero cost” means “already paid”

Oscar_Smith · April 22, 2020, 2:56am

Sometimes it means payed once by the compiler instead of 1 million times by runtime though. That can be quite nice.

c42f · April 22, 2020, 3:28am

Right, definitely, I don’t mean to overstate things as they are now! Just thinking aloud about what efficient behaviors could motivate having this as real for loop syntax. For example, iterating table rows for $SomeTable could have a heuristic like “if the number of columns is small” iterate named tuples, otherwise fall back to iterating on DataFrameRow. Maybe that’s a bad idea, but there’s certainly some interesting possibilities when the table gets to really control the iteration.

tkf · April 22, 2020, 4:30am

Yes, that’s definitely a happy case for foldl-based approach (and it’d be the main target). I just tried to bring up the compile latency problem as this has to be addressed if I were to shoot for this to be the default lowering target (e.g., maybe hooking into Base.Experimental.@optlevel or something?)

This is great and I appreciate your comments!

Yeah, this sounds like a practical and effective approach. If anyone wants to give a shot at this, it can be done by defining asfoldable(::DataFrameRows) here

TsurHerman · April 23, 2020, 1:43pm

Regarding the function barrier ”technique”

While I was exploring alternative implementations for Julia I came by

Which advocates the use of CPS (continuation passing style) as a layer before LLVM.

In CPS the problem of function barrier goes away or at least easily solvable by the compiler itself since dynamic dispatch can be made to occur only at the source of type uncertainty.

I strongly believe that a function barrier could be automatically introduced by the compiler and that the user should not care for such considerations.

tkf · April 23, 2020, 10:29pm

Hi, thanks for the link! Very interesting. Indeed, from a quick look, it looks like their higher-order function-based approach is very close to mine. Some of the things mentioned in the talk were exactly what I had mind to try at some point (e.g., customizable loop scheduling as a higher-order function).

I agree that function barrier technique is not so user-friendly interface to communicate with the compiler. It’d be nice to have something that has low syntax cost than creating a separate function. That said, I’m not sure if we want to do this for every loop automatically. Some loops are OK to be slow and trying to optimizing everything will make the first-time-to-plot problem worse. I think we need some compiler hint to tell something like “if some type instability occurs here, I want the loop over there to be outlined and type-stabilized” (or maybe just a way to opt-out this behavior is enough, like @optlevel 0?).

TsurHerman · April 24, 2020, 10:20am

I think you know my views regarding “time for first plot”, what I called “Context Dispatch” in previous posts.
Julia can be made to be caching all of its compiler output for easy reuse, but the rules governing the construction of dispatch tables must be changed… this idea encountered any kind of objections
“it can’t be done”
“it would be hard to reason about code”
“too late to change now”

I will not go into it now, but it can be done … and so by some version of Murphy Law, it will be done sometime in the future.
So in that version of Julia, compilation can take any amount of time since it happens only once and not each Julia session.

Roughly the idea with CPS and type stability is that in CPS form a program is a series of function calls that do one atomic operation and then call a continuation with it.

Some atomic operations like loading a value from global scope or eval something returns an unknown type.
The compiler can handle it by inserting a dispatch table instead of a typed continuation.

If this was to be implemented somehow(it is not trivial) it would handle also cases where the type changes ones inside the loop and then stays the same until completion.

I think it is equivalent to the rule “a function which may return different types must be called in tail position”, but instead of forcing the programmer to use it, it enforces this in compile time.

anon56330260 · April 24, 2020, 12:07pm

But LLVM uses SSA-formed IR, which can be rewritten to CPS-styled codes in functional language, actually there exists a deep connection between them.(Reference:SSA book).
So does it really matter whether we use CPS-style or not?

TsurHerman · April 24, 2020, 3:14pm

It doesn’t matter, but you do need some layer of representation that is in CPS form to do the said optimizations, before lowering to llvm’s SSA form.

In the link I shared they do exactly that.

tkf · April 24, 2020, 11:02pm

Even if we have a super smart compiler, I think, in the end, only the programmer knows if a given loop should be specialized or not (before executing the program [*1]). It is conceivable to have a function that does something like

function f(xs)
    s = 0
    for x in xs
        v = Val(x)
        for y in g(v)
            s += y
        end
    end
    return s
end

I don’t think it’s always beneficial to specialize for y loop. I also don’t think it’s possible to figure out if the specialization is desired or not just from the type information.

Additionally, I think there is a subtle problem when introducing function boundaries automatically as it makes very easy to hit stack overflow and it makes unpredictable when it happens. This probably can be avoided by introducing some kind of local counter to track how much auto-specialization has happened within a call stack. However, it makes the performance of a function unpredictable because how much auto-specialization it can have depends on who calls it.

[*1] So maybe tracing JIT would be a better approach for this?

anon56330260 · April 25, 2020, 5:16am

I just don’t understand. I have checked the project and scan through their papers. But the examples they use are a little simple, so I don’t know why we shoule prefer CPS instead of SSA.
The project you have mentioned presents a system AnyDSL, which uses a CPS-styled IR throin. It seems that AnyDSL can do a good job in partial evaluation, but how can it be used to solve type-unstable problems.After all, throin are functional and static-typed, so it doesn’t need to worry about such things.

TsurHerman · April 25, 2020, 4:23pm

@anon56330260

See the following code:


using BenchmarkTools
f(n) = begin
    S = eval(0)
    for i = 1:n
        S += 1.0
    end
    S
end


# CPS form
g(n) = begin
    S = eval(0)
    iter_n(S, n, add_S)
end

iter_n(S, n, continuation) = begin
    if (n > 0)
        continuation(S, n - 1, iter_n)
    else
        return S
    end
end

add_S(S, n, continuation) = begin
    S += 1.0
    continuation(S, n, add_S)
end

@benchmark f(100) # 2.4 us
@benchmark g(100) # 420 ns

g in CPS form is faster although without some form of tail-call optimization it will produce stack overflow on large values of n.

I think it would be best to continue any discussion in a separate thread.

tk3369 · April 25, 2020, 7:59pm

Can you elaborate and perhaps provide an example about why eachline is unsafe with break and exception-safe?

Are you referring to the fact that the file will be closed only when the EachLine object is garbage collected as documented in the manual?

tkf · April 25, 2020, 8:13pm

Yes, that’s what I meant.

c42f · April 27, 2020, 3:15am

Yes this is not “unsafe” in the sense of the unsafe_ family of functions, it’s actually quite safe compared to that.

The problem with returning a resource with eachline is that resource use and cleanup is not constrained within the call stack of the code which uses it. Instead we must rely on finalizers which are called later to clean up resources that the GC doesn’t know about. Non-deterministically finalized iterables used with for loops are only one example of the hard general problem of “not relying on finalizers”.

Not relying on finalizers while retaining the benefits of garbage collection is hard enough that other GC-based language runtimes don’t have fully satisfying solution either (that I know about, at least). As far as I know solutions are generally some form of context manager like is enabled by normal julia functions coupled with julia do blocks. Eg, C# has the IDisposable interface and the using statement Using objects that implement IDisposable - .NET | Microsoft Learn.

tkf · April 27, 2020, 5:58am

Thanks for elaborating the point. Maybe it’s better to tweak the manual and recommend

open(filename) do io
    for line in eachline(io)
        ...
    end
end

? Doing this is probably not super important for throw-away scripts but it is if you open a lot of files, implement long-running services, etc.

Re: CPS I just found

github.com

FluxML/IRTools.jl/blob/6d227c0edb828b7c761c97fe899dd33c03e69b56/examples/continuations.jl#L1-L6


      
          # An implementation of delimited continuations (the shift/reset operators) in
          # Julia. Works by transforming all Julia code to continuation passing style.
          # The `shift` operator then just has to return the continuation.
          
          
# https://en.wikipedia.org/wiki/Delimited_continuation
          # https://en.wikipedia.org/wiki/Continuation-passing_style

I guess it’s possible to define iterate/foldl from a single “function” definition syntax (something like Python’s yield, syntax-wise) by somehow hooking into this. Defining both iterate and foldl at the same time would be nice so that the “iterator” created by this syntax would work outside of the foldl API. It should also be possible to automate something like Tail-call optimization and function-barrier -based accumulation in loops

davidanthoff · April 27, 2020, 9:34am

Also see this old issue on this topic. I still like my original idea there

tkf · April 27, 2020, 10:15pm

Doing this requires the compiler to eliminate the try block when it can prove dispose does nothing. On the other hand, if it is defined as a foldl, the author can explicitly choose to use or not to use the try block. Furthermore, it is much more composable because foldl can use arbitrary language constructs like try-catch, do, GC.@preserve, etc.

Besides, Query.jl is perhaps the easiest framework to use foldl internally by converting the iterator transforms to transducers (see the “adjoint” conversion I explained in Transducer as an optimization: map, filter and flatten by tkf · Pull Request #33526 · JuliaLang/julia · GitHub). You don’t need to wait for changing the iterator protocol to make resource management robust in Queryverse

Topic		Replies	Views
ANN: Parallel `for` loops in FLoops.jl with composable and extensible fold-based API Package Announcements parallel , multithreading , distributed , loops	17	3496	August 25, 2020
[ANN] FGenerators.jl: Python-like generator syntax for high-performance iteration Package Announcements loops	7	1944	February 19, 2022
ANN: Transducers.jl, efficient and composable algorithms for map- and reduce-like operations Package Announcements	24	5478	March 25, 2019
Tail-call optimization and function-barrier -based accumulation in loops Internals & Design question , type , loops	13	7158	January 16, 2021
PyGen - python style generators Community package , announcement	55	9104	January 16, 2019

[RFC/ANN] FLoops.jl: fast generic for loops (foldl for humans™)

Related topics