[RFC] PipelessPipes.jl (now Chain.jl)

PipelessPipes.jl is not a package, yet, as I would be interested in hearing some opinions first. Here are my thoughts about Pipe.jl and Lazy.jl, which I have tried previously:

Pipe.jl

pros

  • _ syntax is nice and non-magic

cons

  • I don’t like typing |>
  • The last entry of a pipe can’t have a trailing |>, making it harder to comment out parts
  • If there’s an error in one part, the whole expression errors and VSCode can’t highlight the specific part
  • end of pipeline just “dangles” in terms of indentation, which means the IDE always tries to indent also the following lines
  • you can’t really interject some random statement for checking what’s going on inside the pipe that well

Lazy.jl’s @>:

pros

  • begin / end is nice for indentation after
  • no |> necessary

cons

  • Not every function needs piped thing as first argument, like filter (also not necessarily last, the other option @>>)
  • Looks a bit too magic for me when the first argument is not visible at all

PipelessPipes.jl

edit: As I’ve made some changes since the first post, and have registered those changes now, here’s the relevant excerpt from the readme

PipelessPipes defines the @_ macro. It takes a start value and a begin ... end block of expressions.

The result of each expression is fed into the next one using one of two rules:

  1. There is at least one underscore in the expression
  • every _ is replaced with the result of the previous expression
  1. There is no underscore
  • the result of the previous expression is used as the first argument in the current expression, as long as it is a function call or a symbol representing a function.

Lines that are prefaced with @! are executed, but their result is not fed into the next pipeline step.
This is very useful to inspect pipeline state during debugging, for example.

Example with a DataFrame:

using DataFrames, PipelessPipes

df = DataFrame(group = [1, 2, 1, 2], weight = [1, 3, 5, 7])

result = @_ df begin
    filter(r -> r.weight < 6, _)
    groupby(:group)
    @! println("There are $(length(_)) groups after step 2.")
    combine(:weight => sum => :total_weight)
end

The pipeless block is equivalent to this:

result = let
    var1 = filter(r -> r.weight < 6, df)
    var2 = groupby(var1, :group)
    println("There are $(length(var2)) groups after step 2.")
    var3 = combine(var2, :weight => sum => :total_weight)
end

For debugging, it’s often useful to look at values in the middle of a pipeline.
You can use the @! macro to mark expressions that should not pass on their result.
For these expressions there is no implicit first argument spliced in if there is no _, because that would be impractical for most purposes.

For the error message point, look at the difference to Pipe.jl here from some random code of mine:

PipelessPipes.jl

grafik

Pipe.jl:

Design Questions

I wonder what the best name for the macro is, because all the good ones are already taken by other packages. I like @_ because it’s really short, although it would conflict with Underscores.jl, if one wanted to use both.

The other question is if this ruleset has some obvious problems to you. I haven’t found serious issues so far, but maybe I’ve overlooked some circumstances under which this expression parsing would break.

27 Likes

It’s good to see more experimentation in the Piping space. Here are two thoughts

  1. I t would be great if you could optionally omit the _ and have that default to the first argument. But then override that if the user provides _ anywhere
  2. Having a trailing |> can be annoying, but if you are copying and pasting to the REPL it’s easier to use |> because otherwise you need to add an end which is annoying.

It would be great if it came in a “block” form and a “magrittr” form.

4 Likes

My first thought was very against this. I’ve reread it and now I kind of like it. I do use Underscores.jl more and more often, and for things that aren’t necessarily data pipelines, so I’m not a fan of the conflicting symbols unless you are going to provide a somewhat substantial replication of its behavior. How does your proposal fair outsides of DataFrames usage?

Added to the big list

3 Likes

Julius @jules , Thank You very much for this RFC around enhancing the previous Linux/Unix Pipe syntax which is a very helpful foundation and a natural fit for distributed processing, but in this iteration I hope we can elevate (or is that continue ? See below **) the language syntax to a slightly higher abstraction layer using mathematical notation, while simultaneously keeping the super fast and efficient Automatic vectorization gears hidden/encapsulated to achieve this >> https://en.wikipedia.org/wiki/Automatic_vectorization

So also request support syntax for Function Composition (computer science) described here >>

" The ability to easily compose functions encourages factoring (breaking apart) functions for maintainability and code reuse. More generally, big systems might be built by composing whole programs ."

And so here is another example in a similar proposal that seems to have come out of some previous RFC / design work :

It’s especially important ( to provide an alternative to current |> syntax ) when you are using Julia for data analysis, where you commonly have data transformation pipelines.

In particular, Pandas in Python is convenient to use because you can write things like df.groupby(“something”).aggregate(sum).std().reset_index() , which is a nightmare to write with the current |> syntax.

@@ https://github.com/JuliaLang/julia/issues/5571#issuecomment-33437023

**(or continue ? See this >> ) IF this >> My mental load using Julia is much higher than, e.g., in Python. How to reduce it? << is already implemented as it appears to be , it would already address what I’m requesting here per this statement :

I think what you’re talking about is “fluent interfaces”, and it’s true that we don’t really do that in Julia, at least not in the same way. You might find the |> operator useful, since you can do:


> x |> f |> g

as an alternative notation for g(f(x)) .

.

However as @rdeits has previously NOTED there is a “missing” functionality >> Notably, in a fluent interface, there is no way to chain a function that isn’t a method of the returned object. So if sort() is a generic function (and not a method of whatever drawInContext returns), then you can’t do:

	line.mirror.drawIncontext(ctx).sort()

So if all ^the above^ has been said and done and the g(f(x)) notation is
in place , then for this RFC I’d like the proposed new PipelessPipes.jl Package Design to be mostly about Julia compiler / LLVM optimizations similar to https://github.com/JuliaFolds for optimizing compilation of the new functional composition syntax to automatically evaluate how to /if can SIMD / Vectorize ** / Parallelize / Thread the Nested functions – As a starter for more background consider searching reading about SIMD designs >>

The hardware handles all alignment issues and “strip-mining” of loops. Machines with different vector sizes would be able to run the same code.[6] Clang LLVM ( used by Julia which ) calls this vector type “vscale”.

-and-

https://discourse.julialang.org/search?q=SIMD

Re: Existing Vectorize syntax ** ( highly desirable to keep compatible with proposed new functional composition syntax Package Design )

HTH

Ps > CCing Some others in for possible interest or additional insight (like is this possible ? LOL ; and would it help them ) :
@rdeits @MikeInnes @tkf

It seems to me that these disadvantages only apply (as you noted) specifically to @> and @>>, but not really to the third threading macro introduced by Lazy.jl: the @as macro. For example:

julia> using Lazy: @as
julia> using DataFrames

julia> df = DataFrame(size=10*rand(10), weight=100*rand(10), color=rand([:red, :green], 10), shape=rand([:square, :circle], 10));

julia> @as _it_  df  begin
           filter(r->r.size>1, _it_)
           groupby(_it_, [:color, :shape])
           combine(_it_, :weight => sum => :total_weight)
       end
4×3 DataFrame
 Row │ color   shape   total_weight 
     │ Symbol  Symbol  Float64      
─────┼──────────────────────────────
   1 │ red     circle      191.114
   2 │ red     square        8.4176
   3 │ green   circle      154.489
   4 │ green   square      146.474
2 Likes

looks like a good candiate. release it as a package soon.

personally, for now I just use Lazy.jl’s @>

1 Like

This is exactly what I’ve been waiting for. Thanks for working this out!

1 Like

In my latest revision I’m doing that as I do agree that it saves quite some visual clutter for the most common cases.

I’ve also changed the rule with the double underscore, because I think it didn’t communicate well that there was a special line in the pipeline that shoudn’t affect the result.

Now there is a @! fake-macro that when prefacing a line disables the auto-first-arg insertion and also takes the line out of the pipeline. The _ still works. So one can write the upper example like this, note the inserted display command. I find things like that very useful for debugging, your mileage may vary :wink:

result = @_ df begin
    filter(r -> r.trial_type === "looped-audio-image-form", _)
    select([:subject, :checkboxes])
    groupby([:subject, :checkboxes])
    combine(:checkboxes => ByRow(extract_checkboxes) => AsTable)
    @! display(first(_, 5)) # have a peek at the first 5 rows here for debugging
    select(Not(:checkboxes))
    flatten([:group, :image, :checked])
    groupby([:group, :image])
    combine(:checked => sum)
    sort([:group, order(:checked_sum, rev = true)])
    groupby(:group)
end

The nice side effect is that one is forced to @! mark every line where one is not using the replacement mechanism, either implicit first argument or explicit with _. That means there is less danger of sticking a line in there that influences the result unexpectedly.

3 Likes

I was sure I had seen something like this already, and actually Lazy.jl had the @_ macro. It looks like it was introduced here. I’m not completely sure why it was removed (maybe it had issues with the all-underscore identifier used as rvalue deprecation), but I’m glad it came back in PipelessPipes.

Registered it now, and updated the first post to reflect that

1 Like

Nice! With the addition of implicit-first-argument piping, I think this implements all the features that I want in a piping macro. I argued for something similar in DataFramesMeta.

This is slightly off-topic, but when I do use Pipe.jl or Hose.jl, I like to write the pipes like this:

using DataFrames.jl
using Hose.jl

df = DataFrame(a = repeat(1:5, outer = 20),
               b = repeat(["a", "b", "c", "d"], inner = 25),
               x = repeat(1:20, inner = 5))

out = @hose (
    df
    |> @transform(y = 10 * :x)
    |> @where(:a .> 2)
    |> @by(:b, meanX = mean(:x), meanY = mean(:y))
    |> @orderby(:meanX)
    |> @select(:meanX, :meanY, var = :b)
)

There is less noise from the piping operators when they are all lined up. :slightly_smiling_face:

1 Like

Absurd how few lines of code are necessary to make a package like this! Was surprised when I looked at the source.

5 Likes

I guess I’m a little late with this comment, since you’ve already registered the package, but @chain is available. If you named the package Chain.jl, then we would have Pipe.jl, Hose.jl, and Chain.jl. :slight_smile:

Yeah no kidding! A better programmer than me would probably need even less, but I tend towards legible verbosity :wink:

That’s also a good idea, technically I still have 3 days to change it :man_shrugging:

1 Like

PipelessPipes is not a bad name, but I think Chain is more pithy. Also, @chain is more descriptive than @_. I vote for changing the name to Chain.jl. :smiley:

2 Likes

Yeah I actually agree with that, I think @chain is not too bad to type either, I couldn’t think of something similarly short at the time. I’ll see if I can change it

5 Likes

Personally, I would just use Chain.jl once all the bugs are fixed.

The presumptive bugs or all the bugs you found already? :beetle::bug::ant:

1 Like