[ANN/RFC] Multi-column expressions with implicit broadcasting for DataFrameMacros.jl

Hi all, I want to share a cool upcoming feature of DataFrameMacros with you (currently in this PR https://github.com/jkrumbiegel/DataFrameMacros.jl/pull/15) and get some thoughts on two specific issues.

The problem I wanted to solve was the issue of transformations over multiple columns. As you probably know, DataFrameMacros and DataFramesMeta, which so far worked relatively similar, only work for transformations on single columns. So something like :x + :y works, but there is nothing like dplyrโ€™s across functionality. But this is something I badly wanted to have and thought about how to get it.

First I wanted to include an additional construct like @across, but then it hit me, why not use the existing machinery of DataFrames better. So far, Iโ€™ve constructed src => function => sink expressions in my macros, but why not just change that to srcs .=> function[s] .=> sinks. The nice thing is that the new functionality doesnโ€™t break the old one at all, as all the single-column specifiers (Symbol, String, Int) just broadcast to single results, but now you get the ability to use any multi-column specifier in a functional expression and the resulting mini-language construct will be auto-broadcasted over all columns.

Here Iโ€™ll copy-paste a few examples without context, for more explanation have a look at the preview docs at Tutorial ยท DataFrameMacros.jl

Selection with a Function

julia> @select(df, $(endswith("e")))
891ร—3 DataFrame
 Row โ”‚ Name                               Age        Fare
     โ”‚ String                             Float64?   Float64
โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
   1 โ”‚ Braund, Mr. Owen Harris                 22.0   7.25
   2 โ”‚ Cumings, Mrs. John Bradley (Florโ€ฆ       38.0  71.2833
   3 โ”‚ Heikkinen, Miss. Laina                  26.0   7.925
   4 โ”‚ Futrelle, Mrs. Jacques Heath (Liโ€ฆ       35.0  53.1
  โ‹ฎ  โ”‚                 โ‹ฎ                      โ‹ฎ         โ‹ฎ
 889 โ”‚ Johnston, Miss. Catherine Helen โ€ฆ  missing    23.45
 890 โ”‚ Behr, Mr. Karl Howell                   26.0  30.0
 891 โ”‚ Dooley, Mr. Patrick                     32.0   7.75
                                             884 rows omitted

Transformation over multiple columns selected by Type

julia> @select(df, Float32($Real))
891ร—6 DataFrame
 Row โ”‚ PassengerId_Float32  Survived_Float32  Pclass_Float32  SibSp_Float32  Parch_Float32  Fare_Float32
     โ”‚ Float32              Float32           Float32         Float32        Float32        Float32
โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
   1 โ”‚                 1.0               0.0             3.0            1.0            0.0        7.25
   2 โ”‚                 2.0               1.0             1.0            1.0            0.0       71.2833
   3 โ”‚                 3.0               1.0             3.0            0.0            0.0        7.925
   4 โ”‚                 4.0               1.0             1.0            1.0            0.0       53.1
  โ‹ฎ  โ”‚          โ‹ฎ                  โ‹ฎ                โ‹ฎ               โ‹ฎ              โ‹ฎ             โ‹ฎ
 889 โ”‚               889.0               0.0             3.0            1.0            2.0       23.45
 890 โ”‚               890.0               1.0             1.0            0.0            0.0       30.0
 891 โ”‚               891.0               0.0             3.0            0.0            0.0        7.75
                                                                                         884 rows omitted

Transformation of sink column names

julia> @select(df, lowercase.($Real) .* "_32" = Float32($Real))
891ร—6 DataFrame
 Row โ”‚ passengerid_32  survived_32  pclass_32  sibsp_32  parch_32  fare_32
     โ”‚ Float32         Float32      Float32    Float32   Float32   Float32
โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
   1 โ”‚            1.0          0.0        3.0       1.0       0.0   7.25
   2 โ”‚            2.0          1.0        1.0       1.0       0.0  71.2833
   3 โ”‚            3.0          1.0        3.0       0.0       0.0   7.925
   4 โ”‚            4.0          1.0        1.0       1.0       0.0  53.1
  โ‹ฎ  โ”‚       โ‹ฎ              โ‹ฎ           โ‹ฎ         โ‹ฎ         โ‹ฎ         โ‹ฎ
 889 โ”‚          889.0          0.0        3.0       1.0       2.0  23.45
 890 โ”‚          890.0          1.0        1.0       0.0       0.0  30.0
 891 โ”‚          891.0          0.0        3.0       0.0       0.0   7.75
                                                           884 rows omitted

Multi-dimensional broadcasting

julia> @select(df, ["a" "c"; "b" "d"] = $[:Survived, :Pclass] * $(permutedims([:Survived, :Pclass])))
891ร—4 DataFrame
 Row โ”‚ a      b      c      d
     โ”‚ Int64  Int64  Int64  Int64
โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
   1 โ”‚     0      0      0      9
   2 โ”‚     1      1      1      1
   3 โ”‚     1      3      3      9
   4 โ”‚     1      1      1      1
  โ‹ฎ  โ”‚   โ‹ฎ      โ‹ฎ      โ‹ฎ      โ‹ฎ
 889 โ”‚     0      0      0      9
 890 โ”‚     1      1      1      1
 891 โ”‚     0      0      0      9
                  884 rows omitted

RFC part

There are two things Iโ€™m currently thinking about. One is that $(All()) is not the most sparse syntax and theoretically, the macro could treat All() as a column identifier because thatโ€™s DataFrames syntax and shouldnโ€™t mean anything else. So should All(), Between(), etc be special cased for not having to use $?

The other thing is that I would like an option to broadcast parts of the functional expression itself as well, currently there will only be one function broadcasted across all column combinations. But letโ€™s say you wanted to add increasing numbers to the first 100 columns, my idea is to use the @b flag macro to signal that a part of the expression should be taken out and broadcasted with the rest. Something like this:

@select(df, $(1:100) + @b(1:100)

# would be equivalent to

select(df, names(df, 1:100) .=> broadcast(i -> ByRow(x -> x + i), 1:100)))

It looks a bit cryptic at first maybe, but it could be quite powerful. If you have an opinion, Iโ€™d be interested in hearing it to form a good decision.

5 Likes