[ANN] DataPipes.jl 0.3.0

Funny, this is already implemented in Transducers.jl (all the more reasons to use this awesome package)

using Transducers
using Underscores

julia> data = 5:12
5:12

julia> @_  data         |>
           Filter(_>10) |>
           Map(_^2) |>
           collect
2-element Vector{Int64}:
 121
 144

and no type piracy here.

2 Likes

Yes Transducers.jl is really cool for many reasons.

But sometimes you just want to do some quick data processing without any extra dependencies, which is why I kind of wish we had versions of normal map() and filter() as above.

1 Like

Out of my habit, I find it easier to write/read code where the functions are embedded (# 2) rather than the form # 1.
But I like the ability to shorten the syntax.
The first form I tried was # 5, which doesn’t work. Then I found the other shapes that give the expected result, but I’m not sure I understand why # 5 doesn’t work and # 4 does.
But maybe there is an even more correct way to get what I was looking for.


@p 1:4 |> map(_^2) |> filter(exp(_) > 5)        #1

filter(x->exp(x)>5,map(x->x^2,1:4))             #2

@p filter(exp(_)>5, @p begin map(_^2,1:4) end ) #3


@p filter(exp(_)>5, @p(map(_^2,1:4)))           #4

@p filter(exp(_)>5, @p map(_^2,1:4) )           #5

I would like to have some considerations on the different efficiency of the various forms

All these five variants work for me and give exactly the same results with DataPipes.jl 0.1.7. Could you please share the error you are getting?

Maybe you could get rid of map and broadcast, for example (respecting the number of steps):
@p 1:4 .|> _^2 |> filter(exp(_)>5)

1 Like

ok. I have the same version (downloaded yesterday) and now I get the same results as you. I don’t know what to think.
I apologize for the wrong report.
I had initially tried with the expression

@p filter(exp(_)>5, map(_^2,1:4) ) 

which I then corrected in form # 5 and since it seemed to me (perhaps confused with the initial form) that this didn’t work, so I tried first with # 3 and then with # 4.

Could you in this case use the nested form with the placeholder _1?

PS
is it possible to retrieve the log of the outputs (the one of the inputs I have) of yesterday’s session made in the vscode environment?

The _1 placeholder has a very different semantics now, don’t think it is possible to combine them somehow.
And explicit delimiters (like @p) are necessary anyway with the nested function style you use. Otherwise the meaning would be ambigous.

Thanks for taking the time.
Maybe I’m still missing something to understand how @p works.
If I did a brutal substitution of f (_) with x-> f (x), then again the naive form I used first would work.
Even using the same variable name, since the scope is different

 filter(x->exp(x)>5, map(x->x^2,1:4) )

Instead from the following two tests it seems that @p acts, in case of nested functions only at the high level.

julia> @p filter(exp(_)>5, map(x->x^2,1:4) )
3-element Vector{Int64}:
  4
  9
 16

julia> @p filter(x->exp(x)>5, map(_^2,1:4) )
ERROR: MethodError: no method matching filter(::var"#137#139", ::var"#138#140")

Unfortunately, the naive replacement you suggest doesn’t really work well in general. The main issue is to determine function boundaries: what does a(b(c(_), _), d(e(f(_) + _))) get converted to?
That’s why DataPipes.jl takes the approach of converting only top-level function arguments (containing _) to anonymous functions.

In the meantime, DataPipes.jl has got a significant new feature, and the placeholder style changed to a more convenient one.

You can continue using the old syntax in 0.1.x versions, or switch to the 0.2 version with the updated placeholders. Future features will only appear in 0.2 and won’t be backported.

Improved docs

The README at Alexander Plavin / DataPipes.jl · GitLab is now shorter and doesn’t go into depths. A more detailed documentation together with a set of worked out data processing tasks is available as a Pluto notebook: see HTML version.

New feature: @aside macro

Available in both 0.1 and 0.2 versions.
Perform a side computation without breaking the pipeline:

@p begin
	data
	@aside avg = mean(_.age)
	map((; _.name, _.age, above_average=_.age > avg))
end

Also plays nice with @export to export the variable to the outer scope:

@p begin
	data
	@aside @export avg = mean(_.age)
	map((; _.name, _.age, above_average=_.age > avg))
end

# avg is available here

Idea for the @aside macro is taken from Chain.jl

New placeholder syntax

Old DataPipes@0.1:

  • _ - first/only lambda argument
  • __, ___, … - second, third, and further lambda arguments
  • - result of the previous step
  • _1 - lambda argument _ of the outer pipe

New DataPipes@0.2:

  • _ - first/only lambda argument, same as before
  • _2, _3, … - second, third, and further lambda arguments (_1 also works, and is equivalent to _)
  • __ - result of the previous step
  • _ꜛ - lambda argument _ of the outer pipe (type the arrow with \^uparrow)

Motivation:

  • Referring to the previous step result turned out to be more common than I originally thought. The old symbol is more difficult to type compared to the new __, which is important for common operations. Also, parses as an operator name in Julia, which required putting it into brackets sometimes.
  • _2 instead of __ is about the same typing effort, and __ got already occupied in the previous bullet.
  • _ꜛ is somewhat more difficult to type, but accessing the outer pipe argument is needed much less often in my experience. So, a compromise here is acceptable. Still, _ꜛ parses as a regular name in Julia, and doesn’t require extra parens unlike the old .
    Suggestions of other placeholder to replace _ꜛ are welcome!
7 Likes

Didn’t think I’ll push another significant update soon, but here it is (:
upd: DataPipes@0.2.1 is registered in General.

I perform essentially all data manipulation tasks with DataPipes, and didn’t encounter many pain points with it. Still, there are a couple of common scenarios that can be made cleaner with less boilerplate. They mostly revolve around working with nested data, and now I addressed some of these scenarios.

A common pattern is lambda functions consisting only of inner “pipes” (@p), especially with the map function. Like this simple parsing of a key-value string into a named tuple:

@p begin
	"a=1 b=2 c=3"

	split
	map() do kv
		@p begin
			split(kv, '=')
			Symbol(__[1]) => parse(Int, __[2])
		end
	end
	NamedTuple
end

Now, it has a more succinct syntax in DataPipes: the lambda function body is automatically wrapped with an inner pipe when the only argument is __ (double underscore). The intuition is that __ refers to the previous pipeline step in DataPipes, and by assigning to __ we effectively start a new pipe.
Here is the same example using the new feature:

@p begin
	"a=1 b=2 c=3"

	split
	map() do __
		split(__, '=')
		Symbol(__[1]) => parse(Int, __[2])
	end
	NamedTuple
end

Essentially, we got rid of one nesting level and the @p begin end boilerplate.

Idea that such nesting can be simplified is taken from a post on Julia slack. Unfortunately, cannot find that post anymore.

4 Likes

@aplavin this looks great!

Is the Readme up to date there? I see it was created 11 months ago; or am I looking at the wrong thing?

See possible explanation here.

Thanks, that makes sense!

@aplavin FWIW I often use the date of the last change to assess if a repo is still being maintained, and skipped this one in favor of others because I thought it wasn’t being changed. Maybe I’m an outlier? If not, it might be worth “rounding” by month / quarter rather than year, or adding a note to the Readme of the date of the latest release, or including the release on GitLab.

Maximilian, indeed the linked explanation is correct. I was also thinking about finer rounding, or just assigning all commit dates to “now” - release times are effectively public anyway. Will probably go with this approach…
Anyway, JuliaHub shows correct release dates at the package page JuliaHub, so maybe I should just link to JuliaHub instead of the repo itself.

1 Like

Announcing another update, that should help writing and debugging long pipelines:

The @pDEBUG macro

Its intended usage is when your pipeline doesn’t work as expected, possibly throwing an error somewhere:

julia> @p begin
           1:5
           map(_ ^ 2)
           filter(_ > 3)
           only
       end
ERROR: ArgumentError: Collection has multiple elements, must contain exactly 1 element

Replace (temporarily) @p with @pDEBUG, and it’ll export all intermediate results till the first error into the _pipe variable:

julia> @pDEBUG begin
           1:5
           map(_ ^ 2)
           filter(_ > 3)
           only
       end
ERROR: ArgumentError: Collection has multiple elements, must contain exactly 1 element

julia> _pipe
3-element Vector{Any}:
 1:5
 [1, 4, 9, 16, 25]
 [4, 9, 16, 25]

_pipe is a vector of all intermediate pipe results in their order.

Now, you can clearly see why the error in only happened.

Further, all intermediate pipe variables are also exported:

julia> @pDEBUG begin
           1:5
           x = map(_ ^ 2)
           ...
       end

julia> x
 [1, 4, 9, 16, 25]

Note: @pDEBUG exports to the global scope, so the variables are accessible even if the pipe is inside a function.

3 Likes

Sorry for being off-topic, but consider updating your clock :smiley:
Screenshot 2022-01-12 at 14.51.36

4 Likes

That’s my hello from near future (:

What’s a little floor vs ceil between friends…

3 Likes

A new version of DataPipes is now released: 0.3.0.
README and docs are significantly improved. Unfortunately, the first post in this thread isn’t editable anymore.

Nothing breaks in the core piping functionality. The only breaking change in the package is the removal of convenience functions that were previously defined in DataPipes: filtermap, mutate, and a few more.
Now, DataPipes just implements its piping syntax and does nothing else. There are no dependencies anymore, and the loading time is less than 1 ms. These changes make DataPipes itself a no-brainer to include as a dependency, even in very lightweight projects.

Of course, those removed data processing functions are useful by themselves as well. Over time, more and more of such functions gathered in DataPipes, which didn’t really make sense conceptually. I’m making them more general and performant, and plan to release as another package soonish. If you also use them, feel free to stay on DataPipes@0.2 for now. For now, basically the only difference between versions is removal of those functions in 0.3.

There are also a couple of minor fixes/improvements in the core piping functionality since the previous announcement here. I haven’t encountered any serious issues for a long time in my pretty heavy usage of DataPipes. So, recent changes addressed some remaining corner cases:

  • implicit inner pipes (that start with __) now work everywhere they make sense, including kwargs
  • @p let ... end and @p begin ... end forms generate corresponding blocks, let or begin, making variable scoping consistent with plain Julia
  • function (x) ... end is treated exactly the same as x -> ...
  • qualified function calls also work without brackets, as in regular pipes: @p data |> Iterators.flatten
7 Likes

Let me announce another release of DataPipes, v0.3.5 - already registered in General.

The main highlight since the last update is

Pipe broadcast: .|>

Sometimes operations are more natural to write as a map call, sometimes as a broadcast. Making this even more convenient, DataPipes now supports the broadcasted pipe, .|>.

For the regular pipe |>, the __ placeholder gets replaced with the result of the previous step. Likewise, for the .|> broadcasted pipe, __ means a single element of the previous step result.
When this placeholder is not used, DataPipes implicitly appends it to the function arguments, same for |> and .|>.

Some examples:

julia> @p "1, 2, 3, 4" |> eachmatch(r"(\d)") .|> __.captures[1]
4-element Vector{SubString{String}}:
 "1"
 "2"
 "3"
 "4"
julia> @p [[1, 2], [3]] .|> __ .+ 1
2-element Vector{Vector{Int64}}:
 [2, 3]
 [4]
julia> @p 1:10 |> group(_ % 3) .|> map(_ ^ 2)
3-element Dictionaries.Dictionary{Int64, Vector{Int64}}
 1 │ [1, 16, 49, 100]
 2 │ [4, 25, 64]
 0 │ [9, 36, 81]

Especially the last case demonstrates that .|> is convenient for processing nested datasets in a single line. Alternatives, such as nested maps, are more noisy for simple one-liner pipes.

9 Likes