[ANN] DataPipes.jl

Yet another take on piping in Julia.

Even with multiple existing implementations of the piping concept in general, I didn’t find one that is convenient enough [for my usecases] while still remains general. Additionally, I became curious in how metaprogramming works, so developing DataPipes helped me understand a lot of that.

DataPipes inteface is designed with common data processing functions (e.g., map and filter ) in mind, but is not specifically tied to them and can be used for all kinds of pipelines. This package is extensively tested, and I almost always use it myself for data manipulation.

Unlike many (1, 2, 3; all?) other alternatives, DataPipes :

  • Gets rid of basically all the boilerplate for functions that follow the common Julia argument order
  • Can be plugged in as a step of a vanilla pipeline
  • Can define a function instead of immediately applying it
  • Can easily export the result of an intermediate step

If I missed another implementation that also ticks these points, please let me know.

DataPipes tries to minimally modify regular Julia syntax and aims to stay composable both with other instruments (e.g. vanilla pipelines) and with itself (nested pipes).

See usage examples in README.
DataPipes is submitted to the General registry, and meanwhile can be installed from https://gitlab.com/aplavin/DataPipes.jl.

18 Likes

This @export bit is cool. I wonder if this can be added to @chain.

I’m a bit weary of the non-standard evaluation of map. Seems like it would be clearer to the reader if the non-standard map was called @map.

1 Like

What exactly do you mean by non-standard evaluation of map? map function is not treated specially in any way.

Ah I see. the 2nd argument is inserted instead of the first by default, which I guess makes sense working with Base julia rather than a data ecosystem.

This is an interesting package!

Well, it’s not just the 2nd argument: e.g. @p 1:5 |> sort(by=_ % 2) and other examples in README.
And I’m not sure what you refer to as “a data ecosystem”. README shows examples with the SplitApplyCombine package - this is [to my knowledge] the most popular one with functions like group and join that work for arbitrary data structures.

As I said, all functions with the common Julia argument order work just as convenient as Base ones. Others can also be called of course, but require manually specifying where to put the output of the previous step.

In DataFrames, and DataFramesMeta, among others, the data frame is the first argument for select, subset, transform, rename etc. so it’s convenient for @chain, for example, to have the first argument be piped.

1 Like

Ah, the DataFrames ecosystem - now I see. Yes, that’s what I guessed from the Chain.jl readme. You are right, I didn’t focus on them here.
I personally stopped using dataframes some time after moving to julia. It feels so liberating not to be restricted to flat 2d structure, while still getting great performance and all generic functions continue to work. [unlike in python]

2 Likes

This is quite cool. I like the syntax for nested pipes.

What exactly does double underscore do? I’m having a hard time understanding that.

Double underscore is the second argument of a lambda: just as 2*_ expands to x -> 2*x, _ + __ expands to (x, y) -> x + y.

For now, names of functions that always accept two-argument lambdas are explicitly listed in the code: src/pipe.jl · master · Alexander Plavin / DataPipes.jl · GitLab. This ensures that the lambda is always made two-argument when needed: in such places, 2*_ is converted to (x, y) -> 2*x. I don’t know if there are other possible ways to achieve this.
Still, lambdas that actually use their 2nd argument (such as _ + __ or just __) work everywhere.

2 Likes

Actually, just fixed handling of multivariate lambdas :slight_smile: Before, stuff like map(_ + __, a, b) didn’t work, and now it does!

2 Likes

A bit out of topic… this is the first package I see that is on GitLab instead of GitHub… from the Julia registry and Pkg prospettive, it doesn’t change anything? The registration/tag bots? The add/update package commands?

I cannot compare to github because all my public julia packages (https://juliahub.com/ui/Packages?q=aplavin) are hosted on gitlab - see Using Opensource hosting repo instead of GitHub - #14 by aplavin for motivation.
Registrator works just fine through the juliahub web interface. Adding/deving/updating also works, of course.
There was a mention of a tagbot for gitlab at Using Opensource hosting repo instead of GitHub - #10 by oxinabox (by @oxinabox), but I don’t use it.

4 Likes

Looks very nice on first sight. I especially like the fact that it is possible to load the package without the abbreviated macro‘s. Very neat trick with the modules :ok_hand::ok_hand:

this is the pipe package i’ve been waiting for: simple, elegant, consistent, and easy to understand. it is a good well-thought designed implementation. i hope more people will embrace horizontal programming or one-liner expression to group one big task as smaller tasks in a pipeline. it is natural to read expressions horizontally, not vertically. it also save vertical space :wink: and encourages definition of smaller functions. great job!

3 Likes

Thanks for the kind words (:
As for single- and multi-line pipelines, I regularly use both.

Single line for something short and simple, such as
@p table |> filter(_.c.d > 5) |> map(_.a.b) |> sum
or even

plot(
    x=@p(table |> map(distance(_.a.coords, _.b.coords))),
    y=@p(table |> map(_.value)),
)

Here it’s not shorter than the regular map-lambda syntax, but easier extensible. Also, I find the pipe-style nicer to read.

And multiline pipelines for longer data processing tasks with a number of steps. If such processing is reasonably self-contained, it is most convenient to write as a pipeline instead of using intermediate variables.

2 Likes

the Readme and the examples are easy to digest and straightforward. it is faster to see examples than wordy descriptions because we learn by patterns and the best way to capture generalizability of certain features is to provide different examples. If you have other nice examples that show the power of datapipes to simplify workflow specially verbose ones, please add more. the Readme is enjoyable to read and can serve as reference by itself being self-contained. exporting variables to outside is a nice hack and can be useful. it is nice also to create some temporary variable or the idea of uparrow and double underscore. very elegant idea. more power to your package and maybe contribute more nice packages :wink:

3 Likes

The registration PR to General got merged, so DataPipes.jl can now be installed with just ]add DataPipes !
Btw, if anyone starts using it, and can see a way to make pipes even more convenient for common data processing tasks - without losing generality - please let me know.

8 Likes

What is the equivalent of

julia> @chain 1 begin rand(1:1000) + _; (_,_) end
(351, 351)

I tried

julia> @p 1 |> rand(1:1000) + _ |> (_,_)
ERROR: syntax: all-underscore identifier used as rvalue around

Not sure if this:

julia> @p 1 |> map(rand(1:1000) + _) |> map((_,_))
(355, 355)

@aplavin, could you please provide a minimum working example for the table filtering and plotting syntax you used? Thank you.