Fixing the Piping/Chaining Issue

Well, I think the main advantage of using a unicode binary operator (or maybe hijacking ' in a package) over PR #24990 is that it can be implemented right now without changing the parser. But aside from that, it does basically the same thing as PR #24990 except with an additional noisy character. In order to get a □ >† 5 syntax to work I think we would need a parser change. But if we’re going to do a parser change, then we might as well just do PR #24990. :slight_smile:

To clarify, the c() / ' implementation above only captures exactly one function call. Unfortunately, I don’t think it helps with things like _.x > 2.

2 Likes

Yeah, I think the ultimate objective should be to find settle on a syntax which can be a proper part of the language, so that the effort to build out the other things that would depend on it can be justified.

Fun sidenote though, for the adjoint-hijacking script, you can use : instead of . There’s hardly an instance that you would care to send : as an argument to a function anyway.

1 Like

This proposal is interesting so I implemented it in a JuliaSyntax draft PR. You can play with it in a live Julia session/REPL as described here:

I also implemented a (extremely hacky) version of underscore lowering to go along with that PR for comparison.

I admit I’m slightly exhausted by the underscores/piping syntax discussion at this point and I (ahem) have not yet read this whole thread :slight_smile: Though I’ve read enough to see several ideas reinvented here which have also been discussed in the past! For example, the adjoint trick has been done before IIRC.


I can see that the original proposal in this thread has been somewhat-abandoned. But as a demo of how the infrastructure in JuliaSyntax.jl might help with this kind of design discussion I think the PR might be somewhat interesting.

16 Likes

I admit I haven’t read this thread fully, only skimmed it, but didn’t see actual examples implemented with the proposed syntax vs with already existing general solutions in packages. For ease of comparison, this would be great.

Personally, I now use (my) DataPipes.jl for piping. Below is an example of tabular data processing with it, showcasing main features — looks quite convenient and reads intuitive to me. Anything that can be improved further with new syntax?

using DataPipes, FlexiGroups

tbl = [(id=rand(1:2), day=Day(rand(1:10)), value=rand()) for _ in 1:30]

@p let
    tbl
    group((;_.id, _.day))
	map() do __
		filter(dayofweek(Date(2022) + _.day) ∈ 1:5)
		@aside @info "foobar" length(__)
		sum(_.value; init=0.)
	end
	@aside total = sum()
	map((val=_ |> ceil |> Int, frac=_ / total))
end

From what I’ve seen, earlier “builtin currying” proposals only helped with very simple cases. If a package like DataPipes.jl remains needed for anything more complex anyways, there is no big deal to get used to a macro.

That’s quite useful indeed, and DataPipes.jl neatly supports this usecase (Underscores.jl also do, btw). It just fits within builtin julia pipes, so that you can start typing an expression without foreseeing that advanced piping would be needed further down the line: some_long_table |> @f(filter(isodd(_.day)) |> map(_.value)) |> sum.

DataPipes.jl seems cool, but it’s implementation is distinct from Chain.jl in that it creates an anonymous function.

This highlights a theme in this thread. Some people want only a syntax transformation. Like the way %>% works in R. x %>% f(y) is simply syntactic sugar for f(x, y). I don’t want to worry about anonymous functions or scoping issues, and Chain.jl lets me do that.

Of course, as the maintainer of DataFramesMeta.jl, I’m being hypocritical, since those macros all construct anonymous functions. But those should be avoided if possible.

1 Like

DataPipes only creates anonymous function when the user explicitly requests it. For example, @p abc |> func(x) |> sum doesn’t create any functions, it basically expands to

let
A = abc
B = func(x, A)
C = sum(B)
end

Yes, anonymous functions in Julia can have performance overhead sometimes, and that’s one of the reasons why DataPipes doesn’t create them. Another reason is that it’s just more straightforward to transform to a sequence of operations than to create anonymous functions for each.

Also agree, that’s one of the things I like about data manipulation with generic functions: no functions like map/filter/group/join/etc need macros or special syntax, there’s only a single top-level macro @p that wraps the whole pipeline to make syntax more convenient.

Can you expand on this? Do you have any examples in mind?

Potentially related to https://github.com/JuliaLang/julia/issues/15276

I can see why now! :stuck_out_tongue_closed_eyes:

I will look into your code, but I will be slow to get started, so please forgive me.

In now having gotten to implement [some version of] it, did you grow to develop any opinion on the competing proposals, for /> and \> as in my OP versus |> and _ as in #24990? I’ve begun to lean in the direction of the latter, because it seemed friendlier to the parser and it turns out to be a more general solution for the problem of partial evaluation.

Ah, gotcha. It seems to be well-covered in the language manual. Specifically, for a function that defines an inner function that captures a variable r (which in the example is an Int),

“Captured” variables such as r that are shared by inner functions and their enclosing scope are also extracted into a heap-allocated “box” accessible to both inner and outer functions because the language specifies that r in the inner scope must be identical to r in the outer scope even after the outer scope (or another inner function) modifies r . … the parser emits code for box that holds an object with an abstract type such as Any , which requires run-time type dispatch for each occurrence of r .

In the case of functor objects like Base.Fix1 and Base.Fix2, or my OP proposals for FixFirst and FixLast, or the newer proposal for Fix, I don’t see how it’s an issue. Someone with more gray hairs can perhaps tell me otherwise.

I don’t think this is necessarily a drawback, when seen as a more generic language feature which can be inserted anywhere, called outside any macro, and is useful to inform autocomplete. Certainly functions and macros can be written which make behavior more convenient for specific use cases, but the purpose of the proposal is to ultimately to solve problems outside of data pipelines too. (A big one for me is method name autocomplete, as I envy the OOP guys on name discovery when exploring a new API, or when waving my hands around trying to remember names of old as my brain turns to mush.)

Your example is quite uniquely suited to making proposal #24990 look bad :laughing: It seems to be largely a result of using almost exclusively functions that take a function as their first argument (so you can pipe into the back with no extra characters), and anonymous functions that nest their argument multiple operations deep (restricting _ to take on only one value per line). In my estimation, the behavior doesn’t seem well-generalizable.

I do see the appeal in scoping a block wherein a single object, and successive transformations of it, is the “star of the show” with a special keyword dedicated to it. As mentioned previously, the English language reserves the keyword “it” for such cases.

I think this pipeline isn’t too bad written in normal code though (and is more legible imo):


let it = tbl
    it = group(x->(; x.id, x.day), it)
    it = map(it) do it
        it = filter(x->dayofweek(Date(2022) + x.day) ∈ 1:5, it)
        @info "foobar" length(it)
        sum(x->x.value, it; init=0.)
    end
    total = sum(it)
    it = map(x->(; val=x |> ceil |> Int, frac = x / tot), it)
end

There is also a bit of lag in interactive usage when you construct a complicated function. It’s not big but its enough to be a pain in DataFramesMeta.jl. See below. Each @transform etc. call makes an anonymous function. And this gets compiled anew every time its called in global scope.

The compilation time goes away when it’s put in a function, of course. But that’s not how people use DataFramesMeta.jl

julia> @time @chain df begin 
           @rtransform :c = :a * :b + 1
           @rorderby :a + 100
           @transform :d = :c .- sum(:c)
       end;
  0.134528 seconds (204.70 k allocations: 10.776 MiB, 98.11% compilation time)

julia> @time @chain df begin 
           @rtransform :c = :a * :b + 1
           @rorderby :a + 100
           @transform :d = :c .- sum(:c)
       end;
  0.133628 seconds (204.70 k allocations: 10.777 MiB, 98.04% compilation time)

julia> function clean(df)
           @time @chain df begin 
               @rtransform :c = :a * :b + 1
               @rorderby :a + 100
               @transform :d = :c .- sum(:c)
           end;
       end;

julia> @time clean(df);
  0.110371 seconds (151.90 k allocations: 7.739 MiB, 98.72% compilation time)
  0.257646 seconds (716.01 k allocations: 37.929 MiB, 3.71% gc time, 99.41% compilation time)

julia> @time clean(df);
  0.000401 seconds (644 allocations: 51.656 KiB)
  0.022266 seconds (5.71 k allocations: 322.674 KiB, 97.48% compilation time)
1 Like

Do you think compilation time is a good design constraint to apply to a language feature such as this? I had previously thought not, but I could be wrong!

Part of my thinking too is that for partial applications of only one or two arguments, specializations could be used which might compile faster. Haven’t thought this one through entirely.

As I wrote above, transparent syntax transformations are much preferred, by me, to anything creating functions on the fly.

Yes well, faster is always better than slower. The question is, at what cost.

I’ve listed a few downsides to creating functions on the fly as a form of piping.

  1. Hard to reason about
  2. Could change behavior (definitely changes scoping)
  3. Might cause slower interactive use

You are free to continue advocating for a Fix-style function creation. What the costs and benefits are for each method is up to you.

Meanwhile, the downsides to not creating functions on the fly are

  1. Creating dedicated syntax that happens to be perfect for creating partial functions, but not using it to create partial functions.

If there’s enough benefit to it, it could be interesting to add functionality to, say, the do keyword to force a syntax transformation when chaining:


[1, 2, 3] do filter(isodd,_); map(_^2, _); sum; sqrt end

# Acts like

[1, 2, 3] |> filter(isodd,_) |> map(_^2,_) |> sum |> sqrt

# But through syntax transformation instead of function creation

so we could get partial functions and have it compile fast when we don’t want them.

There is no reason you can’t use the syntax to create partial functions too.

@foo a |> b(c,_,d)

could evaluate to a value, while

@foo b(c,_,d)

evaluates to a curried function

1 Like

I apologize if I’m being obtuse, but is a couple hundred milliseconds unacceptable for interactive use?

And I don’t know the package, but from a quick glance it seems like the call to clean(df) is creating three anonymous functions. The time it takes to do such a thing doesn’t seem to dominate (assuming this is a legitimate way to determine compile time, I need validation):


julia> @btime eval(:(  θ -> cos(θ)+im*sin(θ)  ))
  442.700 μs (191 allocations: 10.39 KiB)
#113347 (generic function with 1 method)

julia> @btime eval(:(  (θ -> cos(θ)+im*sin(θ))(1.)  ))
  3.351 ms (7790 allocations: 435.34 KiB)
0.5403023058681398 + 0.8414709848078965im

Using this approach, it does appear that creating a Fix object does require extra time compared with simply calling the function it partially evaluates, but not as much as creating an anonymous function:

julia> @btime eval(:(  w(x, y) = x + y  ))
  177.800 μs (137 allocations: 7.39 KiB)
w (generic function with 1 method)

julia> @btime eval(:(  w(1, 2)  ))
  46.800 μs (35 allocations: 1.77 KiB)
3

julia> @btime eval(:(  Base.Fix1(w,1)(2)  ))
  68.900 μs (49 allocations: 2.42 KiB)
3

julia> @btime eval(:(  FixFirst(w,1)(2)  ))
  57.000 μs (42 allocations: 2.11 KiB)
3

julia> @btime eval(:(  Fix{(1,)}(w,(1,))(2)  ))
  86.900 μs (68 allocations: 3.27 KiB)
3

julia> @btime eval(:(  (y->w(1, y))(2)  ))
  2.135 ms (1464 allocations: 88.44 KiB)
3

Note that this is the same Fix functor defined above; it has not yet been optimized further.

If I’m testing this incorrectly, please let me know, as I have not before tried to benchmark compile times.

Edit: there’s more nuance to compile time, see below.

1 Like

Sorry, I got to this late and can’t read all 158 messages to know if this is redundant, but FWIW, I can never remember by looking at it which is the back or forward slash. And my personal opinion is that \> visually looks like it is leaning toward the first argument and /> looks like it’s leaning toward the last.

I’d use this feature 1000% either way, but it would definitely be easier for me to remember if we went for a visual instead of aural mnemonics!

3 Likes

A bit more compile-time benchmarking…

# State of the art, no piping:
julia> @btime eval(:(  map(Base.Fix2(^,2), filter(isodd, [1,2,3]))  )) 
  91.500 μs (65 allocations: 3.42 KiB)
2-element Vector{Int64}:
 1
 9

# OP proposal:
julia> @btime eval(:(  FixFirst(map, FixLast(^,2))(FixFirst(filter, isodd)([1,2,3]))  )) 
  93.600 μs (70 allocations: 3.70 KiB)
2-element Vector{Int64}:
 1
 9

# SOTA + piping
julia> @btime eval(:(  [1,2,3] |> Base.Fix1(filter, isodd) |> Base.Fix1(map, Base.Fix2(^,2))  )) 
  125.600 μs (91 allocations: 4.77 KiB)
2-element Vector{Int64}:
 1
 9

# OP proposal for partial functions, but using piping
julia> @btime eval(:(  [1,2,3] |> FixFirst(filter, isodd) |> FixFirst(map, FixLast(^,2))  )) 
  95.700 μs (70 allocations: 3.80 KiB)
2-element Vector{Int64}:
 1
 9

# New proposal for general-purpose functor w/ piping (à la #24990)
julia> @btime eval(:(  [1,2,3] |> Fix{(1,)}(filter,(isodd,)) |> Fix{(1,)}(map,(Fix{(2,)}(^,(2,)),))  )) 
  170.400 μs (146 allocations: 7.31 KiB)
2-element Vector{Int64}:
 1
 9

# piping into anonymous functions
julia> @btime eval(:(  [1,2,3] |> x->filter(isodd, x) |> x->map(Base.Fix2(^,2), x)  )) 
  3.439 ms (6780 allocations: 387.51 KiB)
2-element Vector{Int64}:
 1
 9

# one anonymous function, fed to `filter`, no pipes
julia> @btime eval(:(  map(Base.Fix2(^,2), filter(x->x%2==1, [1,2,3]))  ))
  8.452 ms (11000 allocations: 567.00 KiB)
2-element Vector{Int64}:
 1
 9

# one anonymous function, fed to `map`, no pipes
julia> @btime eval(:(  map(x->x^2, filter(isodd, [1,2,3]))  )) 
  19.296 ms (48195 allocations: 2.47 MiB)
2-element Vector{Int64}:
 1
 9

# `map` and `filter` on anonymous functions, no pipes
julia> @btime eval(:(  map(x->x^2, filter(x->x%2==1, [1,2,3]))  ))
  30.295 ms (59130 allocations: 3.02 MiB)
2-element Vector{Int64}:
 1
 9

# all anonymous functions w/ pipes
julia> @btime eval(:(  [1,2,3] |> x->filter(x->x%2==1, x) |> x->map(x->x^2, x)  ))
  29.688 ms (64845 allocations: 3.34 MiB)
2-element Vector{Int64}:
 1
 9

Look at those last four! :scream: Why’s it so slow!?

@dlakelan looks like I owe you an apology, it appears anonymous functions can take glacial amounts of time to compile sometimes (not sure when or why yet). Gives huge motivation to use partial application functors (e.g. Fix) instead where possible.

Edit: there’s more nuance to compile time, see below.

1 Like

Any time you can get the effect by syntax rearrangement it will have zero overhead at runtime so I think it’s definitely preferable