[ANN] DataPipes.jl 0.3.0

I think those are a bit different. Doing this with Char, the @p version creates an array.


julia> @chain 1 begin @show(rand('a':'z')) + _; (_,_) end
rand('a':'z') = 'm'
('n', 'n')

julia> @p 1 |> map(@show(rand('a':'z')) + _) |> map((_,_))
rand('a':'z') = 'w'
0-dimensional Array{Tuple{Char,Char},0}:
('x', 'x')

I don’t really understand what this pipeline supposed to do, but I guess you try to refer to the previous result with _ here. This is not how DataPipes interprets it: _ is treated as the argument to an anomymous function. E.g.,

@p begin
    1:10
    map(2*_)
end

is equivalent to

a = 1:10
b = map(x -> 2*x, a)

If you do need to explicitly refer to the previous result, use . Your example should be written as @p 1 |> (rand(1:1000) + ↑) |> (↑, ↑) or as @p 1 |> +rand(1:1000) |> (↑, ↑).

As for @rafael.guerra’s map approach, it creates an array just because julia map(x -> x, 'a') creates an array.

1 Like

Well, there is little more to it than what I’ve shown in the previous post.
Here is a very simple example for a simple table - array of flat namedtuple rows:

using DataPipes, PyPlot

# if table is in CSV:
# using CSV, Tables
# table = CSV.read("file.csv", rowtable)

# example table :
table = [(; a=rand(), b) for b in 1:50]

table_processed = @p begin
    table
    filter(_.b > 10)
    map((; _.b, myvalue=_.a + _.b))
    sort(by=_.b, rev=true)
    # ...
end

plt.scatter(
    x=@p(table_processed |> map(2*_.b)),
    y=@p(table_processed |> map(_.myvalue)),
)

Of course, “not-flat” rows and arbitrarily nested structures are supported - everything just uses regular julia functions here, no magic (:

Are you interested in some specific steps here?

1 Like

IMO, this could very well find a place in the first line of the Readme section. :wink:

Thanks!

2 Likes

@aplavin, thanks for the brilliant example.

Could you shed some light on the multi-dimensional structures you refer to when you say:

And if possible provide a DataPipes example.

Thank you.

Referring to previous results is already explained in the README - see the “advanced” section. This is not something often needed in [my] data processing pipelines, so I decided to keep explanation outside the “basic” section.

1 Like

My example above used a flat table - an array of namedtuples. This was done just for simplicity, however.

Working with nested structures looks basically the same. Here is another example with longer more intuitive field names for clarity:

using DataPipes, PyPlot

data = [
    (
        id=rand(Int16),
        prev=(time=rand(), value=rand()),
        next=(time=rand(), value=rand())
    )
    for _ in 1:50
]

data_processed = @p begin
    data
    filter(_.next.value > _.prev.value)
    map((; _.id, change=(time=_.next.time - _.prev.time, value=_.next.value - _.prev.value)))
end

plt.scatter(
    x=@p(data_processed |> map(abs(_.change.time))),
    y=@p(data_processed |> map(_.change.value)),
)

Of course, one can use arbitrary structs instead of plain namedtuples.

More advanced operations like group/join/… on arbitrary datasets are implemented in the SplitApplyCombine package - performant and easy to use (appreciations to @andyferris). They conveniently work within DataPipes pipelines with no additional action on my part.

4 Likes

Interesting package. It seems the design is rather similar to my package Underscores.jl:

https://c42f.github.io/Underscores.jl/stable/

The main differences being

  1. Underscores.jl supports _1, _2 etc (or _₁, _₂ if you like unicode) for multiple placeholders, rather than _, __, etc
  2. You implicitly insert the lambda argument called in your code (this placeholder is __ in Underscores.jl). Underscores doesn’t implicitly insert these which makes it usable outside of piping constructs. If we could figure out a good general way to implicitly insert them that would be nice.
  3. You’ve got a version which uses begin-end syntax for larger piping constructs. (I’ve generally found that multi-line |> is pretty ok for this.)

I think the package goals are similar — I wanted it to be usable with SplitApplyCombine-like data analysis.

5 Likes

Thanks for pointing out: Underscores.jl must have slipped my attention. This is the sole reason why it’s not mentioned in the 1st post along with other _ alternatives.

I think you are right, and those are the main differences in the “core” syntax.
There are other features in DataPipes on top of the “core”. E.g., plugging in the middle of vanilla pipeline without changing anything: from 1:5 |> sum to 1:5 |> @f(filter(_ > 3) |> map(_ ^ 2)) |> sum; @export macro to assign another variable; do-syntax support; nesting pipes-in-pipes.

The different design decisions are likely due to focusing on different usecases. From this perspective, it’s good that Julia didn’t get a built-in _ or _-like syntax, and let users explore different approaches. Writing @p or @_ adds almost no visual overhead anyway.

For example, I use to explicitly refer to the previous step result because this is rarely needed in my experience - so a more convenient __ syntax is used for a more common operation. Support of the begin-end and do syntaxes stems from writing longer or more complex data pipelines: I find them still more convenient than explicit variable assignment for each step. And so on. Clearly, the most common operations and corresponding trade-offs can be different for others.

I’d like to insert the previous result after all ::Function arguments and before everything else. This
would even better correspond to the argument order of Julia functions. However, this doesn’t seem possible with a pure syntax transformation: map(length) should become map(length, ↑), but searchsorted(value) should be searchsorted(↑, value). So, currently DataPipes puts the previous result as the last positional argument unless is used.

2 Likes

I guess beauty is in the eye of the beholder in this case :slight_smile: I think I prefer numbered placeholders (with _ a synonym for _1).

Certainly it’s a matter of taste, but to borrow one of your examples, I find keeping the pipes to be pretty readable:

@_ [1, 2, 3, 4]                  |>
   map((a=_, b=_^2, c=1:_), __)  |>
   filter(length(_.c) >= 2, __)  |>
   map((; _.a, s=sum(_.c)), __)

barring the presence of __ everywhere of course, which is really rather annoying when used in a piping context.

Yes this is tricky and there seems to be no general solution. We discussed a few options in Feature Request: if no `__` is specified then insert into first argument · Issue #17 · c42f/Underscores.jl · GitHub

For piping with eager evaluation it makes sense to always implicitly insert and have a way to override the location where it’s inserted. For functions which already return closures or callable objects (eg, Transducers) you want the opposite — not insert unless asked for. I guess it might be kind-of-possible to have a traits system to infer where you want the inserted. (This would be a per-function rather than per-method decision, but that might be good enough.)

3 Likes

Yes, that’s totally true. Until you want to comment out/remove one of the steps, or use the do-syntax with longer function bodies:

@p begin
   [1, 2, 3, 4]
   map() do _
   ...
   end
   filter(length(_.c) >= 2)
   # map((; _.a, s=sum(_.c)))
end

Transucers.jl seems to have a reasonably convenient piping syntax by itself. But I haven’t really used that package…

Yes, that’s definitely a possibility. I wouldn’t just call it “per-function” in the myfunc(::typeof(map)) sense, but “per-function-name” as in myfunc(::Val{:map}). All such traits would need to be declared in a single package to avoid piracy.

If you do it per-name in a single package, there’s no need for a trait — the list can just be a Set of names.

I did actually mean per-function though, not per name. I think this could be achieved by interplay of syntax expansion and trait, so that f(_^2) is expanded into something like

if collection_in_last_arg(f)
    collection->f(y->y^2, collection)
else
    collection->f(collection, y->y^2)
end

Then rely on type inference + dead code elimination to remove one of the branches and make this efficient.

Whether this is a good idea or not I’m not sure. It can be made to “magically do what people expect” more of the time, but for functions without the trait defined it may use the wrong function argument ordering and be extra confusing for users.

Now I see what you meant with those traits. Agree with:

Purely syntactic transformations are more intuitive and easier to debug when something goes wrong. I thought about requiring all inner function calls to have an explicit underscore: map(_ + 1) and map(isnothing(_)), but not map(isnothing). Then it would be possible to put the implicit argument right after these functions based on syntax alone, following the Julia argument order. But I decided that allowing map(isnothing) is more important than removing the need for an explicit in those rare cases when it should go to another argument (not the last).

1 Like

With so many pipe choices:

  • DataPipes.jl
  • Pipe.jl
  • Chain.jl
  • Hose.jl
  • Transducers.jl
  • Underscores.jl

This plumber is in trouble:

NB: obviously, it is better to have more choices than none but some collaboration of above authors could be beneficial too.

8 Likes

You’re not wrong :laughing:

Despite the apparent simplicity of _ placeholder syntax, the design of a convenient and general but also simple meaning for _ has resisted all our efforts. (At this point there’s been a ridiculous amount of effort put into exploring that design space.) So now we’ve got a bunch of special purpose packages which make various tradeoffs but are mostly centered around piping.

Underscores.jl is my attempt at a general _ placeholder syntax which happens to work with pipes rather than a special purpose piping syntax package.

4 Likes

(:
At least some of the packages you listed are focused on specific usecases, so it shouldn’t be very difficult to choose among these. For example, Chain.jl clearly aims at dataframe operations with their calling conventions; DataPipes.jl’s focus is at data processing operations with common functions following the Julia convetion; Underscores.jl - see Chris’s post.

Nevertheless, there is clearly lots of overlap. Something like @p sum(_ * 2, 1:10) would look basically the same with any approach.

It doesn’t seem easily avoidable indeed… For example, Chain and DataPipes cannot follow the same conventions without adding boilerplate, because typical calls of dataframe functions are significantly different from other functions.

3 Likes

Agreed. DataFrames is very clever and convenient with its domain specific column naming syntax. It’s just unfortunate that the syntax doesn’t mesh well with any reasonably general meaning of _.

The situation is kind of unsatisfactory but it’s unclear how to proceed without a good language syntax option for placeholders. A lightweight way to delimit the extent of the lambda helps but there hasn’t been a lot of support for that, nor a really nice syntax candidate. (The delimiter could either go outside the function which accepts the lambda (as in Underscores.jl and DataPipes) or on the lambda itself, a bit like Swift’s shorthand argument names or key paths.)

Interesting comparison with Swift!

Their let articleIDs = articles.map { $0.id } is indeed a nice and consice lambda syntax. Does it nest?

Regarding key paths: maybe I missed something, but they seem much less general compared to arbitrary functions. As I understand, their close equivalent in Julia are lens in Setfields.jl/Accessors.jl, which have even more advanced functionality.

I guess you mean that a good general meaning of _ would swallow the => operators as part of the lambda function body? I more or less disagree.

  • For a standalone _ (without another piece of syntax to delimit the scope of the lambda) I think the “tight” meaning implemented in #24990 is the best: it’s far from covering all use case but it does have general usefulness, and it’s super clear and readable.
  • For more complex lambda bodies, a delimiter like @_ is required.

We can have both, and they both work nicely with DataFrames :slight_smile:

# Would work if #24990 is merged
transform(df, :a => _.^2 => :a_square)

# Already works with Underscores.jl
transform(df, :a => @_ exp.(_.^2) => :exp_a_square)

and btw both are pretty verbose (:
E.g.,

transform(df, :a => @_ exp.(_.^2) => :exp_a_square, :a => @_ _.^2 => :a_square)

for a dataframe vs

@p mutate(exp_a_square=exp(_.a^2), a_square=_.a^2, df)

with DataPipes for other tables.
Not even talking about multicolumn functions:

@p mutate(comp_val=_.a > _.b ? _.a^2 : _.a, df)

With Base only (no DataPipes) the first example becomes somewhat less pretty, but still shorter than for dataframes:

map(r -> (r..., exp_a_square=exp(r.a^2), a_square=r.a^2), df)
2 Likes