[ANN-RFC] DFMacros.jl

jules · June 11, 2021, 11:11pm

Hi there,

I really like using DataFrames, but what I don’t like so much is redundant syntax and creating all these anonymous functions by hand. I’ve used DataFramesMeta for a while and it’s great, but it’s not quite the right fit for me. That’s why I’m making a new package that does things exactly how I like them, in an opinionated way.

Here’s the current readme. The RFC in the title relates mostly to the name, I’m not really happy with DFMacros . Maybe you have better ideas!

DFMacros.jl

The following macros are currently available:

@transform
@select
@groupby
@combine
@subset
@sort

These are the most important opinionated aspects that differ from other packages:

@transform, @select and @subset work row-wise by default, @combine works column-wise by default. This matches the most common modes these functions are used in and reduces friction.
@groupby and @sort allow using arbitrary expressions including multiple columns, without having to @transform first and repeat the new column names.
Keyword arguments to the macro-underlying functions work by separating them from column expressions with the ; character.
Column expressions are interpolated into the macro with $.
Target column names are written with : symbols to avoid visual ambiguity (:newcol = ...). This also allows to use AsTable as a target like in DataFrames.jl.
A flag macro (@c or @r) can be used to switch between row/column-based mode.
The flag macro can also include the character m to switch on automatic passmissing in row-wise mode.

Examples

using DFMacros
using DataFrames
using Random
using Statistics
Random.seed!(123)

df = DataFrame(
    id = shuffle(1:5),
    group = rand('a':'b', 5),
    weight_kg = randn(5) .* 5 .+ 60,
    height_cm = randn(5) .* 10 .+ 170)

5×4 DataFrame
 Row │ id     group  weight_kg  height_cm
     │ Int64  Char   Float64    Float64
─────┼────────────────────────────────────
   1 │     1  b        64.9048    161.561
   2 │     4  b        59.6226    161.111
   3 │     2  a        61.3691    173.272
   4 │     3  a        59.0289    175.924
   5 │     5  b        58.3032    173.68

@select

@select(df, :height_m = :height_cm / 100)

5×1 DataFrame
 Row │ height_m
     │ Float64
─────┼──────────
   1 │  1.61561
   2 │  1.61111
   3 │  1.73272
   4 │  1.75924
   5 │  1.7368

@select(df, AsTable = (w = :weight_kg, h = :height_cm))

5×2 DataFrame
 Row │ w        h
     │ Float64  Float64
─────┼──────────────────
   1 │ 64.9048  161.561
   2 │ 59.6226  161.111
   3 │ 61.3691  173.272
   4 │ 59.0289  175.924
   5 │ 58.3032  173.68

@transform

@transform(df, :weight_g = :weight_kg / 1000)

5×5 DataFrame
 Row │ id     group  weight_kg  height_cm  weight_g
     │ Int64  Char   Float64    Float64    Float64
─────┼───────────────────────────────────────────────
   1 │     1  b        64.9048    161.561  0.0649048
   2 │     4  b        59.6226    161.111  0.0596226
   3 │     2  a        61.3691    173.272  0.0613691
   4 │     3  a        59.0289    175.924  0.0590289
   5 │     5  b        58.3032    173.68   0.0583032

@transform(df, :BMI = :weight_kg / (:height_cm / 100) ^ 2)

5×5 DataFrame
 Row │ id     group  weight_kg  height_cm  BMI
     │ Int64  Char   Float64    Float64    Float64
─────┼─────────────────────────────────────────────
   1 │     1  b        64.9048    161.561  24.8658
   2 │     4  b        59.6226    161.111  22.9701
   3 │     2  a        61.3691    173.272  20.4405
   4 │     3  a        59.0289    175.924  19.0728
   5 │     5  b        58.3032    173.68   19.3282

column flag @c

@transform(df, :weight_z = @c (:weight_kg .- mean(:weight_kg)) / std(:weight_kg))

5×5 DataFrame
 Row │ id     group  weight_kg  height_cm  weight_z
     │ Int64  Char   Float64    Float64    Float64
─────┼───────────────────────────────────────────────
   1 │     1  b        64.9048    161.561   1.61523
   2 │     4  b        59.6226    161.111  -0.388008
   3 │     2  a        61.3691    173.272   0.274332
   4 │     3  a        59.0289    175.924  -0.613175
   5 │     5  b        58.3032    173.68   -0.888383

@groupby & @combine

g = @groupby(df, iseven(:id))

GroupedDataFrame with 2 groups based on key: id_iseven
Group 1 (3 rows): id_iseven = false
 Row │ id     group  weight_kg  height_cm  id_iseven
     │ Int64  Char   Float64    Float64    Bool
─────┼───────────────────────────────────────────────
   1 │     1  b        64.9048    161.561      false
   2 │     3  a        59.0289    175.924      false
   3 │     5  b        58.3032    173.68       false
Group 2 (2 rows): id_iseven = true
 Row │ id     group  weight_kg  height_cm  id_iseven
     │ Int64  Char   Float64    Float64    Bool
─────┼───────────────────────────────────────────────
   1 │     4  b        59.6226    161.111       true
   2 │     2  a        61.3691    173.272       true

@combine(g, :total_weight_kg = sum(:weight_kg))

2×2 DataFrame
 Row │ id_iseven  total_weight_kg
     │ Bool       Float64
─────┼────────────────────────────
   1 │     false          182.237
   2 │      true          120.992

@sort

@sort(df, -sqrt(:height_cm))

5×4 DataFrame
 Row │ id     group  weight_kg  height_cm
     │ Int64  Char   Float64    Float64
─────┼────────────────────────────────────
   1 │     3  a        59.0289    175.924
   2 │     5  b        58.3032    173.68
   3 │     2  a        61.3691    173.272
   4 │     1  b        64.9048    161.561
   5 │     4  b        59.6226    161.111

passmissing flag @m

df = DataFrame(name = ["joe", "jim", missing, "james"])

@transform(df, :cap_name = @m uppercasefirst(:name))

4×2 DataFrame
 Row │ name     cap_name
     │ String?  String?
─────┼───────────────────
   1 │ joe      Joe
   2 │ jim      Jim
   3 │ missing  missing
   4 │ james    James

pdeffebach · June 11, 2021, 11:40pm

Hey! I’m glad you put development into this.

As the maintainer of DataFramesMeta I have to say I’m a bit bummed you chose to fork the package. In particular, a lot of the things that this package is trying to do I think are just around the corner in DataFramesMeta

row-wise by default. We are almost done with the addition of a @byrow flag to allow for row-wise operations. See here. In particular I really like your idea for .= and it should be easy to add after it’s finished.
groupby with expressions. I think this is a great idea and should definitely be added to DataFramesMeta.
Interpolation with $. I think this is probably the direction to go in.
:x instead of x. I think this is probably the move as well. I have procrastinated because it’s a very breaking change.
Macro flags were hard with the design, but with the addition of the block syntax more flags can be added in the future.

In short I think a lot of these should be added to DataFramesMeta, I think it would be a shame to splinter the ecosystem with multiple packages that implement @transform etc.

Should we try to agree on a plan to upstream lots of these changes into DataFramesMeta?

jules · June 12, 2021, 5:51am

Yeah it would be nice if some of these things made their way into DataFramesMeta, I contributed some of the code there myself so it’s not that I don’t have any interest in that package. It’s more that such a package is, to me, mostly about convenience, so for example @byrow is not really what I want to write all the time, and I’ve noticed in my analysis style I use it almost every line. That’s unlikely to change in DataFramesMeta, right?

About all the other points, note how you said they should or could be added, but were pretty breaking. I’ve followed all the discussions for some months, and it seemed unlikely to me that the project would fully go in this direction, or it would at least take quite long as DataFramesMeta has too many users already which need to be accommodated. To some degree, I need an analysis package for my work now, so I can’t wait for all that to resolve. It’s also not really a fork, I wrote this from scratch to handle some design issues deeper down.

This is more of a “if someone out there happens to have the exact same pain points as me, here could be a solution” thing. It’s not an infrastructure but an end user package, so there’s in my mind no danger of fracturing anything.

So to sum up, dataframesmeta is great, I’m sure the issues mentioned above will resolve at some point one way or another, yet there is also enough space for slightly different approaches I think (the slightly matters to me a lot)

eliascarv · June 12, 2021, 9:16am

Some suggestions for the package name:

DFTools.jl
DFToolbox.jl
DFManipulation.jl
DFManipulations.jl
DataManipulation.jl

I didn’t find good names either .

But I love your package idea, the syntax is perfect! It was definitely the data manipulation package I was looking for.

Congratulations on your amazing work!

EvoArt · June 12, 2021, 10:14am

This looks great. I think this will be more convenient for my analysis work flow than anything else I’ve seen in Julia.

I agree that there should be space for different approaches.

I do have one thought though, re convenience and splintering of ecosystem. Would it be worth having a way to set your default options in dataframesmeta like dfmoptions(groupby = :row) or some such? Maybe this already exists?

jules · June 12, 2021, 10:50am

I have also thought about that. Could be an option Maybe it’s too complicated if you need to keep this option in mind to understand code

juliohm · June 12, 2021, 11:24am

Amazing @jules ! Any chance the package could be generalized to Tables.jl tables as well? That would really solve the major pain point I have with the alternatives. The only package I can use currently is Query.jl because of its generality but it would be nice to see other approaches that work with Tables.jl and have clean syntax.

EvoArt · June 12, 2021, 11:50am

Good point!

xiaodai · June 13, 2021, 11:19am

Very nice and well thought out.

i can see both sides. But I think a bit more choice is never a bad thing. I will definitely give DFMacros a try. On the surface, I will probably adopt DFMacros for my workflow.

Will this have performance implications? Or is it just autobroadcasting and saving of . typiing.

jules · June 13, 2021, 11:47am

Just auto wrapping in ByRow because I think it’s a better default for everything but combine. Many string logic things are annoying to write in broadcasting style for example, or if you have unusual objects where you need to index or access properties. It also makes missings easier to handle, because you can wrap the auto-byrow function in passmissing with the @m flag. Which again pertains mostly to string and object manipulation as number functions such as + and * in many cases already propagate missings.

To add one more thought, I think the crux is that broadcasting is most useful where different dimensions come together. But in DataFrames, everything is forced to be same-length vectors anyway.

CameronBieganek · June 16, 2021, 9:55pm

There is an issue in DataFramesMeta.jl about a possible rename where the name DataFramesMacros.jl was suggested. However, it looks like the maintainers are leaning towards staying with the name DataFramesMeta. So, if they decide to stay with DataFramesMeta, it might be reasonable to rename DFMacros.jl to DataFramesMacros.jl.

juliohm · June 16, 2021, 10:07pm

If you are considering Tables.jl tables in the future, maybe a more general name would make sense without DataFrame on it. TableTools.jl or TableMacros.jl, something short to type.

jules · June 17, 2021, 6:11am

All the macros forward to dataframes functions, I’m not sure if that can be made generic for all tables.

jules · June 17, 2021, 6:15am

I thought about that as well, but then thought it was a bit too similar to DataFramesMeta and could be confusing. I think DF for DataFrame is common enough as an abbreviation no?

JeffreySarnoff · June 17, 2021, 8:36am

Respectfully … no.
While DF for DataFrame is common enough to be understood by those who commonly use DF to mean DataFrame, most of the Julia Community members who may utilize DataFrames for some purpose are not them.

sijo · June 17, 2021, 8:52am

I like the name DFMacros Someone who has no idea about data frames will have to learn about DataFrames.jl anyway so I don’t see this as an obstacle.

As for the redundancy with DataFramesMeta: I also prefer when there is one standard way to do things in a mature ecosystem. It helps a lot with readability and getting familiar with other people’s code. But maybe it’s too early to say what the “standard way” should be here, so I’m happy to see another package trying different things… Better to try things with complete freedom and gather the best parts in a later package.

I’d rather avoid having global state change the meaning of the code. It would be quite bad for readability and sharing code. Imagine if every time you find a solution on Discourse you have to check if it’s valid for your particular defaults…

jules · June 17, 2021, 9:15am

I also prefer when there is one standard way to do things in a mature ecosystem

I think this is an important point, let me stress that I do not aim for this package to become the “standard way”, I’m just putting this up for like-minded people. If DataFramesMeta maintains the close coupling with DataFrames, for example being mentioned explicitly in the documentation, etc, I don’t think there’s a danger for this package to interfere with that.

I would also say that because such macro packages are quite simple (mine is 260 lines of code), it’s not much wasted effort to make a new one. It would be a different story if I attempted to make a whole new DataFrames.jl with a couple of smaller changes.

JeffreySarnoff · June 17, 2021, 9:18am

I am someone who is familiar with data frames and their use. DFMacros is not transparent to me (DataFlow is a more common DF than DataFrame) – this is an example of using an acronym/abbreviation in package naming that the guidelines prefer to avoid (5. Creating Packages · Pkg.jl).

jules · June 17, 2021, 9:21am

I agree in principle, but don’t you think DataFrameMacros and DataFramesMeta is a bit close? I’m not particularly attached to DFMacros

sijo · June 17, 2021, 9:32am

@JeffreySarnoff good point regarding the guidelines! Though maybe that one makes less sense for a package that builds on another package, as we have here… Also I care a lot about clarity in the API (e.g. function names) but I don’t think package names need to be transparent.

Many software projects don’t have transparent names, they just have a name, which is part of their identity. A few random examples: Gurobi, Stan, Ansys, Gumbo, Pango, GTK, Qt, TensorFlow.

Or some more Julian examples: Turing, Gen, Makie, Zygote, Flux. I don’t think there’s anything wrong with these names.

@jules just in case it was not clear, I think it’s good to have this new package now, and maybe consolidate in the future.

Topic		Replies	Views
[ANN] DataFrameMacros.jl v0.3.0 Package Announcements macros , dataframes	0	400	August 15, 2022
[ANN/RFC] Multi-column expressions with implicit broadcasting for DataFrameMacros.jl Package Announcements dataframes	0	340	October 11, 2021
[ANN] DataFramesMeta 0.10.0 release Data announcement	0	546	October 16, 2021
ANN: DataFramesMeta 0.9.0 release Data	1	626	August 13, 2021
A quick proof-of-concept for a macro-less API for DataFrames that's easier to type New to Julia	1	448	August 29, 2020