[RFC] PipelessPipes.jl (now Chain.jl)

jules · November 19, 2020, 4:25pm

PipelessPipes.jl is not a package, yet, as I would be interested in hearing some opinions first. Here are my thoughts about Pipe.jl and Lazy.jl, which I have tried previously:

Pipe.jl

pros

_ syntax is nice and non-magic

cons

I don’t like typing |>
The last entry of a pipe can’t have a trailing |>, making it harder to comment out parts
If there’s an error in one part, the whole expression errors and VSCode can’t highlight the specific part
end of pipeline just “dangles” in terms of indentation, which means the IDE always tries to indent also the following lines
you can’t really interject some random statement for checking what’s going on inside the pipe that well

Lazy.jl’s `@>`:

pros

begin / end is nice for indentation after
no |> necessary

cons

Not every function needs piped thing as first argument, like filter (also not necessarily last, the other option @>>)
Looks a bit too magic for me when the first argument is not visible at all

PipelessPipes.jl

edit: As I’ve made some changes since the first post, and have registered those changes now, here’s the relevant excerpt from the readme

PipelessPipes defines the @_ macro. It takes a start value and a begin ... end block of expressions.

The result of each expression is fed into the next one using one of two rules:

There is at least one underscore in the expression

every _ is replaced with the result of the previous expression

There is no underscore

the result of the previous expression is used as the first argument in the current expression, as long as it is a function call or a symbol representing a function.

Lines that are prefaced with @! are executed, but their result is not fed into the next pipeline step.
This is very useful to inspect pipeline state during debugging, for example.

Example with a DataFrame:

using DataFrames, PipelessPipes

df = DataFrame(group = [1, 2, 1, 2], weight = [1, 3, 5, 7])

result = @_ df begin
    filter(r -> r.weight < 6, _)
    groupby(:group)
    @! println("There are $(length(_)) groups after step 2.")
    combine(:weight => sum => :total_weight)
end

The pipeless block is equivalent to this:

result = let
    var1 = filter(r -> r.weight < 6, df)
    var2 = groupby(var1, :group)
    println("There are $(length(var2)) groups after step 2.")
    var3 = combine(var2, :weight => sum => :total_weight)
end

For debugging, it’s often useful to look at values in the middle of a pipeline.
You can use the @! macro to mark expressions that should not pass on their result.
For these expressions there is no implicit first argument spliced in if there is no _, because that would be impractical for most purposes.

For the error message point, look at the difference to Pipe.jl here from some random code of mine:

PipelessPipes.jl

grafik

Pipe.jl:

Design Questions

I wonder what the best name for the macro is, because all the good ones are already taken by other packages. I like @_ because it’s really short, although it would conflict with Underscores.jl, if one wanted to use both.

The other question is if this ruleset has some obvious problems to you. I haven’t found serious issues so far, but maybe I’ve overlooked some circumstances under which this expression parsing would break.

pdeffebach · November 19, 2020, 4:29pm

It’s good to see more experimentation in the Piping space. Here are two thoughts

I t would be great if you could optionally omit the _ and have that default to the first argument. But then override that if the user provides _ anywhere
Having a trailing |> can be annoying, but if you are copying and pasting to the REPL it’s easier to use |> because otherwise you need to add an end which is annoying.

It would be great if it came in a “block” form and a “magrittr” form.

tbeason · November 19, 2020, 4:35pm

My first thought was very against this. I’ve reread it and now I kind of like it. I do use Underscores.jl more and more often, and for things that aren’t necessarily data pipelines, so I’m not a fan of the conflicting symbols unless you are going to provide a somewhat substantial replication of its behavior. How does your proposal fair outsides of DataFrames usage?

oxinabox · November 19, 2020, 5:04pm

Added to the big list
https://github.com/JuliaLang/julia/issues/5571#issuecomment-205754539

Marc.Cox · November 19, 2020, 8:21pm

Julius @jules , Thank You very much for this RFC around enhancing the previous Linux/Unix Pipe syntax which is a very helpful foundation and a natural fit for distributed processing, but in this iteration I hope we can elevate (or is that continue ? See below **) the language syntax to a slightly higher abstraction layer using mathematical notation, while simultaneously keeping the super fast and efficient Automatic vectorization gears hidden/encapsulated to achieve this >> Automatic vectorization - Wikipedia

So also request support syntax for Function Composition (computer science) described here >>

" The ability to easily compose functions encourages factoring (breaking apart) functions for maintainability and code reuse. More generally, big systems might be built by composing whole programs ."

And so here is another example in a similar proposal that seems to have come out of some previous RFC / design work :

It’s especially important ( to provide an alternative to current |> syntax ) when you are using Julia for data analysis, where you commonly have data transformation pipelines.

In particular, Pandas in Python is convenient to use because you can write things like df.groupby(“something”).aggregate(sum).std().reset_index() , which is a nightmare to write with the current |> syntax.

@@ Function chaining · Issue #5571 · JuliaLang/julia · GitHub

**(or continue ? See this >> ) IF this >> My mental load using Julia is much higher than, e.g., in Python. How to reduce it? - #5 by rdeits << is already implemented as it appears to be , it would already address what I’m requesting here per this statement :

I think what you’re talking about is “fluent interfaces”, and it’s true that we don’t really do that in Julia, at least not in the same way. You might find the |> operator useful, since you can do:


> x |> f |> g

as an alternative notation for g(f(x)) .

.

However as @rdeits has previously NOTED there is a “missing” functionality >> Notably, in a fluent interface, there is no way to chain a function that isn’t a method of the returned object. So if sort() is a generic function (and not a method of whatever drawInContext returns), then you can’t do:

	line.mirror.drawIncontext(ctx).sort()

So if all ^the above^ has been said and done and the g(f(x)) notation is
in place , then for this RFC I’d like the proposed new PipelessPipes.jl Package Design to be mostly about Julia compiler / LLVM optimizations similar to JuliaFolds · GitHub for optimizing compilation of the new functional composition syntax to automatically evaluate how to /if can SIMD / Vectorize ** / Parallelize / Thread the Nested functions – As a starter for more background consider searching reading about SIMD designs >>

The hardware handles all alignment issues and “strip-mining” of loops. Machines with different vector sizes would be able to run the same code.[6] Clang LLVM ( used by Julia which ) calls this vector type “vscale”.

-and-

https://discourse.julialang.org/search?q=SIMD

Re: Existing Vectorize syntax ** ( highly desirable to keep compatible with proposed new functional composition syntax Package Design )

https://docs.julialang.org/en/v1/manual/functions/#man-vectorized

HTH

Ps > CCing Some others in for possible interest or additional insight (like is this possible ? LOL ; and would it help them ) :
@rdeits @MikeInnes @tkf

ffevotte · November 19, 2020, 8:28pm

It seems to me that these disadvantages only apply (as you noted) specifically to @> and @>>, but not really to the third threading macro introduced by Lazy.jl: the @as macro. For example:

julia> using Lazy: @as
julia> using DataFrames

julia> df = DataFrame(size=10*rand(10), weight=100*rand(10), color=rand([:red, :green], 10), shape=rand([:square, :circle], 10));

julia> @as _it_  df  begin
           filter(r->r.size>1, _it_)
           groupby(_it_, [:color, :shape])
           combine(_it_, :weight => sum => :total_weight)
       end
4×3 DataFrame
 Row │ color   shape   total_weight 
     │ Symbol  Symbol  Float64      
─────┼──────────────────────────────
   1 │ red     circle      191.114
   2 │ red     square        8.4176
   3 │ green   circle      154.489
   4 │ green   square      146.474

xiaodai · November 20, 2020, 3:23am

looks like a good candiate. release it as a package soon.

personally, for now I just use Lazy.jl’s @>

tk3369 · November 20, 2020, 4:45am

This is exactly what I’ve been waiting for. Thanks for working this out!

jules · November 20, 2020, 12:19pm

In my latest revision I’m doing that as I do agree that it saves quite some visual clutter for the most common cases.

I’ve also changed the rule with the double underscore, because I think it didn’t communicate well that there was a special line in the pipeline that shoudn’t affect the result.

Now there is a @! fake-macro that when prefacing a line disables the auto-first-arg insertion and also takes the line out of the pipeline. The _ still works. So one can write the upper example like this, note the inserted display command. I find things like that very useful for debugging, your mileage may vary

result = @_ df begin
    filter(r -> r.trial_type === "looped-audio-image-form", _)
    select([:subject, :checkboxes])
    groupby([:subject, :checkboxes])
    combine(:checkboxes => ByRow(extract_checkboxes) => AsTable)
    @! display(first(_, 5)) # have a peek at the first 5 rows here for debugging
    select(Not(:checkboxes))
    flatten([:group, :image, :checked])
    groupby([:group, :image])
    combine(:checked => sum)
    sort([:group, order(:checked_sum, rev = true)])
    groupby(:group)
end

The nice side effect is that one is forced to @! mark every line where one is not using the replacement mechanism, either implicit first argument or explicit with _. That means there is less danger of sticking a line in there that influences the result unexpectedly.

piever · November 20, 2020, 12:59pm

I was sure I had seen something like this already, and actually Lazy.jl had the @_ macro. It looks like it was introduced here. I’m not completely sure why it was removed (maybe it had issues with the all-underscore identifier used as rvalue deprecation), but I’m glad it came back in PipelessPipes.

jules · November 20, 2020, 3:13pm

Registered it now, and updated the first post to reflect that

CameronBieganek · November 20, 2020, 3:47pm

Nice! With the addition of implicit-first-argument piping, I think this implements all the features that I want in a piping macro. I argued for something similar in DataFramesMeta.

This is slightly off-topic, but when I do use Pipe.jl or Hose.jl, I like to write the pipes like this:

using DataFrames.jl
using Hose.jl

df = DataFrame(a = repeat(1:5, outer = 20),
               b = repeat(["a", "b", "c", "d"], inner = 25),
               x = repeat(1:20, inner = 5))

out = @hose (
    df
    |> @transform(y = 10 * :x)
    |> @where(:a .> 2)
    |> @by(:b, meanX = mean(:x), meanY = mean(:y))
    |> @orderby(:meanX)
    |> @select(:meanX, :meanY, var = :b)
)

There is less noise from the piping operators when they are all lined up.

tbeason · November 20, 2020, 4:03pm

Absurd how few lines of code are necessary to make a package like this! Was surprised when I looked at the source.

CameronBieganek · November 20, 2020, 4:30pm

I guess I’m a little late with this comment, since you’ve already registered the package, but @chain is available. If you named the package Chain.jl, then we would have Pipe.jl, Hose.jl, and Chain.jl.

jules · November 20, 2020, 5:19pm

Yeah no kidding! A better programmer than me would probably need even less, but I tend towards legible verbosity

jules · November 20, 2020, 5:19pm

That’s also a good idea, technically I still have 3 days to change it

CameronBieganek · November 21, 2020, 3:15pm

PipelessPipes is not a bad name, but I think Chain is more pithy. Also, @chain is more descriptive than @_. I vote for changing the name to Chain.jl.

jules · November 21, 2020, 4:06pm

Yeah I actually agree with that, I think @chain is not too bad to type either, I couldn’t think of something similarly short at the time. I’ll see if I can change it

xiaodai · November 23, 2020, 3:04am

Personally, I would just use Chain.jl once all the bugs are fixed.

jules · November 23, 2020, 4:10am

The presumptive bugs or all the bugs you found already?

Topic		Replies	Views
ANN: Underscores.jl: Placeholder syntax for closures Package Announcements data , syntax	49	4633	April 6, 2020
[ANN] DataPipes.jl 0.3.0 Package Announcements data , piping	67	7239	November 23, 2022
Fixing the Piping/Chaining/Partial Application Issue (Rev 2) Internals & Design proposal , piping , chaining , partial-evaluation , threading	40	4080	November 26, 2022
Fixing the Piping/Chaining Issue (Rev 3) Internals & Design multithreading , syntax , piping , chaining , threading	89	7989	April 5, 2024
How often do you use the \|> operator? General Usage	56	5502	April 26, 2021

[RFC] PipelessPipes.jl (now Chain.jl)

Pipe.jl

pros

cons

Lazy.jl’s @>:

pros

cons

PipelessPipes.jl

Example with a DataFrame:

Design Questions

Related topics

Lazy.jl’s `@>`: