It’s worth noting that this exact same DataPipes.mutate syntax works with Underscores.@_ (and has done since shortly after it was released). So this is further evidence that these packages are very similar in design. If we could agree on a general solution for implicit __ vs ↑ and naming of arguments, it may be that these packages can join together in some way.
julia> using TypedTables
julia> t = Table(a=[1,2,3], b=[4,5,6]);
julia> @_ mutate(exp_a_square=exp(_.a^2), a_square=_.a^2, t)
Table with 4 columns and 3 rows:
a b exp_a_square a_square
┌─────────────────────────────
1 │ 1 4 2.71828 1
2 │ 2 5 54.5982 4
3 │ 3 6 8103.08 9
julia> @_ mutate(comp_val=_.a > _.b ? _.a^2 : _.a, t)
Table with 3 columns and 3 rows:
a b comp_val
┌───────────────
1 │ 1 4 1
2 │ 2 5 2
3 │ 3 6 3
mutate seems pretty handy. Perhaps something like it could go into SplitApplyCombine?
mutate seems straight from dplyr; if a package is going to go that route I think it would be good to look more comprehensively at the dplyr API so Julia’s version can get a cohesive look-and-feel.
Totally agree that there is a significant overlap, and short single-step examples work completely the same in these packages (maybe even in Chain.jl).
I myself don’t see a general and still no-boilerplate interpretation of the differences, but would be very curious to know if there is any. DataPipes clearly implements a less general approach, but is more convenient for piped data analysis (hence the name (: ). Also, there are pipe-related features on top, like @export macro, and I also plan to add @aside macro like in Chains.jl. They don’t seem fit for a really general _-package like Underscores.jl, but I may be mistaken here.
Currently, this function (and some other short ones) is defined in DataPipes, but not documented. The reason is I don’t know where it is best to put them, and they may be changed/removed at any time. Maybe you are right and SAC.jl is the right place for them to go…
I agree with Chris, that specific functions like mutate should really be out of scope of DataPipes and similar packages. Don’t know or use dplyr myself, but in would be interesting to see someone attempt to implement a similar interface in Julia, if there is none yet. Currently available functions (Base, SAC.jl, …) may be less “cohesive”, but are more general than dplyr and dataframes.
That’s true, there’s some things which will only apply to pipes but are super handy such as having variables for partial results assigned within the pipeline.
Underscores.jl actually does have a small accommodation for |> syntax (also ∘, <|, .|> and .<|), but only in the sense that it recurses into such expressions and applies the same _ replacement rules inside them, rather than treating them as normal call expressions.
That’s a clever and general solution!
Indeed, having a pure syntax indication such as __ in the first pipe step helps distinguish between a function definition and application.
julia> data = 5:12
5:12
julia> @_ data |>
filter(_>10) |>
map(_^2)
2-element Vector{Int64}:
121
144
This is type piracy, of course! But this version of Base.filter isn’t defined, and though Base.map(f::Function) is defined the existing definition of mapping over zero collections seems pretty useless.
Yes Transducers.jl is really cool for many reasons.
But sometimes you just want to do some quick data processing without any extra dependencies, which is why I kind of wish we had versions of normal map() and filter() as above.
Out of my habit, I find it easier to write/read code where the functions are embedded (# 2) rather than the form # 1.
But I like the ability to shorten the syntax.
The first form I tried was # 5, which doesn’t work. Then I found the other shapes that give the expected result, but I’m not sure I understand why # 5 doesn’t work and # 4 does.
But maybe there is an even more correct way to get what I was looking for.
ok. I have the same version (downloaded yesterday) and now I get the same results as you. I don’t know what to think.
I apologize for the wrong report.
I had initially tried with the expression
@p filter(exp(_)>5, map(_^2,1:4) )
which I then corrected in form # 5 and since it seemed to me (perhaps confused with the initial form) that this didn’t work, so I tried first with # 3 and then with # 4.
Could you in this case use the nested form with the placeholder _1?
PS
is it possible to retrieve the log of the outputs (the one of the inputs I have) of yesterday’s session made in the vscode environment?
The _1 placeholder has a very different semantics now, don’t think it is possible to combine them somehow.
And explicit delimiters (like @p) are necessary anyway with the nested function style you use. Otherwise the meaning would be ambigous.
Thanks for taking the time.
Maybe I’m still missing something to understand how @p works.
If I did a brutal substitution of f (_) with x-> f (x), then again the naive form I used first would work.
Even using the same variable name, since the scope is different
filter(x->exp(x)>5, map(x->x^2,1:4) )
Instead from the following two tests it seems that @p acts, in case of nested functions only at the high level.
Unfortunately, the naive replacement you suggest doesn’t really work well in general. The main issue is to determine function boundaries: what does a(b(c(_), _), d(e(f(_) + _))) get converted to?
That’s why DataPipes.jl takes the approach of converting only top-level function arguments (containing _) to anonymous functions.
In the meantime, DataPipes.jl has got a significant new feature, and the placeholder style changed to a more convenient one.
You can continue using the old syntax in 0.1.x versions, or switch to the 0.2 version with the updated placeholders. Future features will only appear in 0.2 and won’t be backported.
Improved docs
The README at Alexander Plavin / DataPipes.jl · GitLab is now shorter and doesn’t go into depths. A more detailed documentation together with a set of worked out data processing tasks is available as a Pluto notebook: see HTML version.
New feature: @aside macro
Available in both 0.1 and 0.2 versions.
Perform a side computation without breaking the pipeline:
@p begin
data
@aside avg = mean(_.age)
map((; _.name, _.age, above_average=_.age > avg))
end
Also plays nice with @export to export the variable to the outer scope:
@p begin
data
@aside @export avg = mean(_.age)
map((; _.name, _.age, above_average=_.age > avg))
end
# avg is available here
Idea for the @aside macro is taken from Chain.jl
New placeholder syntax
Old DataPipes@0.1:
_ - first/only lambda argument
__, ___, … - second, third, and further lambda arguments
↑ - result of the previous step
_1 - lambda argument _ of the outer pipe
New DataPipes@0.2:
_ - first/only lambda argument, same as before
_2, _3, … - second, third, and further lambda arguments (_1 also works, and is equivalent to _)
__ - result of the previous step
_ꜛ - lambda argument _ of the outer pipe (type the arrow with \^uparrow)
Motivation:
Referring to the previous step result turned out to be more common than I originally thought. The old ↑ symbol is more difficult to type compared to the new __, which is important for common operations. Also, ↑ parses as an operator name in Julia, which required putting it into brackets sometimes.
_2 instead of __ is about the same typing effort, and __ got already occupied in the previous bullet.
_ꜛ is somewhat more difficult to type, but accessing the outer pipe argument is needed much less often in my experience. So, a compromise here is acceptable. Still, _ꜛ parses as a regular name in Julia, and doesn’t require extra parens unlike the old ↑.
Suggestions of other placeholder to replace _ꜛ are welcome!
Didn’t think I’ll push another significant update soon, but here it is (: upd: DataPipes@0.2.1 is registered in General.
I perform essentially all data manipulation tasks with DataPipes, and didn’t encounter many pain points with it. Still, there are a couple of common scenarios that can be made cleaner with less boilerplate. They mostly revolve around working with nested data, and now I addressed some of these scenarios.
A common pattern is lambda functions consisting only of inner “pipes” (@p), especially with the map function. Like this simple parsing of a key-value string into a named tuple:
@p begin
"a=1 b=2 c=3"
split
map() do kv
@p begin
split(kv, '=')
Symbol(__[1]) => parse(Int, __[2])
end
end
NamedTuple
end
Now, it has a more succinct syntax in DataPipes: the lambda function body is automatically wrapped with an inner pipe when the only argument is __ (double underscore). The intuition is that __ refers to the previous pipeline step in DataPipes, and by assigning to __ we effectively start a new pipe.
Here is the same example using the new feature:
@p begin
"a=1 b=2 c=3"
split
map() do __
split(__, '=')
Symbol(__[1]) => parse(Int, __[2])
end
NamedTuple
end
Essentially, we got rid of one nesting level and the @p begin end boilerplate.
Idea that such nesting can be simplified is taken from a post on Julia slack. Unfortunately, cannot find that post anymore.