DataFrames.jl development survey

As @xiaodai mentioned, this kind of syntax is impossible in Julia without a macro. Julia evaluates the arguments to functions eagerly, which means that expressions passed as arguments to functions are evaluated before they are actually passed to the function. (They are evaluated in the scope in which the function is called.) So :sex .== "male" evaluates to a single false, which is then passed to getindex. So df[:sex .== "male"] is equivalent to df[false], or getindex(df, false).

R is unusual because it evaluates function arguments lazily. In other words, argument expressions are passed into the function unevaluated. So metaprogramming can happen inside any R function.

The DataFrames filter function follows the same syntax as the Base filter function, where the first argument to filter is the predicate function. This is actually the standard for functional languages. Even Core R uses this syntax for itā€™s Filter function. Itā€™s just dplyr thatā€™s different.

2 Likes

I support the introduction of where inplace of filter anyway, cos filter can be confusing.

I understand your point, but still this code looks very strange:

@pipe df |>
 filter(:sex => ==("male"), _) |>
 groupby(_, :pclass) |>
 combine(_, :age => mean)

If DataFrames.filter needs to be consistent with the Base.filter, then why not make the other DataFrames functions take a DataFrame in the last argument instead of the first one?

1 Like

This argument has already been made and DataFrames.jl will NOT make a change. Thatā€™s why I put a new filter into DataConvenience.jl

using Pipe, DataConvenience
using DataFrames

@pipe df |>
 filter(_, :sex => ==("male")) |>
 groupby(_, :pclass) |>
 combine(_, :age => mean)

Yeah, I saw that. But still, I guess most people would use the DataFrames.filter instead of your filter, right? Thatā€™s why I say it is strange.

100%. But

You can only change what u can change. Unless u become a core dev of DataFrames.jl and can argue for its inclusion, there is no point in talking abt it. I suggest u raise an issue on github if u feel so strongly abt it.

1 Like

Regarding filter. The core design principle of DataFrames.jl is that things taken from Julia Base should work the same way as in Julia Base. The reason is - if someone learns Julia Base then it should be possible to use the same constructs in DataFrames.jl in the same way. The point is that, as opposed to R, a typical program in Julia does not revolve around DataFrames.jl package, but rather around Julia Base and DataFrames.jl is just one of the many add-ons.

filter in Julia Base accepts a predicate as a first argument to allow for using a do-block notation and I do not think it is going to be hanged. Therefore we will not change it in DataFrames.jl.

Similarly if missing is returned from a predicate then filter throws an error - both in Julia Base and in DataFrames.jl.

Finally filter on GroupedDataFrame filters groups not within group.

That is why I am open to add a new function, tentatively called where, that will have a different signature and a different behavior when working with missing and GroupedDataFrame. I have summarized it in https://github.com/JuliaData/DataFrames.jl/issues/2323#issuecomment-699600625.


Now regarding the design objectives for DataFrames.jl for 1.0 release is to ensure that we provide any required operation in an efficient way without using macros. In general currently it is assumed that convenience wrappers libraries will build on top of this to provide more terse syntax (there are many proposals how to do this as different people like different things - this is exactly why DataFrames.jl concentrates on low-level stuff).

Post 1.0 release I think this will not change much, as most likely we will concentrate in DataFrames.jl on performance (multi-threading etc.). But the plan is to introduce more convenience functions (mostly that would extend what is available in Julia Base and Statstics.jl/StatsBase.jl - reusing the API defined there).

However, in the very long run it is possible that DataFramesCore.jl will be split out, in which case we will come back to rethink what should go into DataFrames.jl. But - realistically - this is at least for one year from now (as stabilizing API for 1.0 and then improving performance are top priorities for now). Because of this the ā€œconvenience wrapperā€ packages around DataFrames.jl are encouraged (and there is a lot of work going on in this area currently).

8 Likes

Are you referring to the vote mentioned on Add some methods for elegant piping Ā· Issue #2416 Ā· JuliaData/DataFrames.jl Ā· GitHub ? The first vote was 7 vs 3 in favor of the curried versions. A later vote was 7 vs 8, against the curried versions. You said the first vote was ā€œnot quite conclusiveā€ so certainly youā€™ll agree the second vote is hardly conclusive :wink:
And indeed the issue is still openā€¦

1 Like

Yes - the vote is open. Fortunately it is non-breaking, so we can decide on it post 1.0 release. Also this is a great example of what I have commented on earlier - in many areas people tend to disagree on what is best (as probably different people have different use cases and habits). Therefore we will not rush with deciding on such changes. Rather - we want DataFrames.jl to have what is essential (even if it is not always most convenient) and delegate to utilities packages the convenience methods. Then in the long run we can incorporate the designs that prove robust and liked by the community from these packages into DataFrames.jl.

Note that eg. adding curried versions of the functions:

  1. can be come obsolete if Julia Base starts providing convenient ways to curry methods
  2. in utility packages it is not that great problem if they do type piracy (assuming they are designed as extensions to the core package and kept in sync with it, in particular maintaining proper version bounds in Project.toml).
3 Likes

No, the idea was to include in DataFrames the basic blocks that allow implementing DataFramesMeta (or other frontends) with relatively little code. We have mostly achieved this goal in the recent weeks.

Maybe at some point DataFrames will include convenience macros, or maybe a new package will include them. But thatā€™s really not the priority right now, and thereā€™s nothing wrong with using DataFramesMeta if it fits your needs (after all, in R you use dplyr on top of base data.frame).

3 Likes

I think this is absolutely the right approach. Convenience is highly subjective, so better to separate this out into different packages.

Perhaps a minor suggestion (to @pdeffebach?), but it would be nice if DataFramesMeta had a name that made it clearer what it does. I donā€™t immediately think of ā€˜metaprogrammingā€™ when I hear the word ā€˜metaā€™, and now thereā€™s also DataFramesMacros.jl, so itā€™s all a bit confusing (at least to me).

1 Like

Thanks for the feedback! I filed an issue here to discuss.

1 Like

I also like the data.table syntax a lot, so I made this a while ago https://github.com/jkrumbiegel/FilteredGroupbyMacro.jl. Didnā€™t use it much, yet, to be honest. Maybe already outdated with the new DataFrames developments?

1 Like