[ANN-RFC] DFMacros.jl

Note that many table types already overload common Julia functions such as map. This single API makes it very convenient and consistent to work with them. Add DataPipes.jl (disclaimer: my package) on top to reduce syntactic boilerplate - and turns out that basically all operations from the first post are easy to write for a wide variety of tables!
My translation of those examples to TypedTables.Table is below. It also works as-is with other implementations such as Tables.rowtable (vector of namedtuples). As a bonus, all operations are easily β€œpipeable” (:

julia> using Random
julia> using SplitApplyCombine
julia> using Tables
julia> using TypedTables
julia> using DataPipes
julia> using Missings

julia> table = Table(
           id = shuffle(1:5),
           group = rand('a':'b', 5),
           weight_kg = randn(5) .* 5 .+ 60,
           height_cm = randn(5) .* 10 .+ 170
       )
Table with 4 columns and 5 rows:
     id  group  weight_kg  height_cm
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
 1 β”‚ 1   a      55.1194    150.885
 2 β”‚ 4   b      55.2233    164.44
 3 β”‚ 3   a      51.9789    178.343
 4 β”‚ 5   a      59.483     166.938
 5 β”‚ 2   b      53.3129    179.829

julia> @p table |> map((height_m = _.height_cm / 100,))
Table with 1 column and 5 rows:
     height_m
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€
 1 β”‚ 1.50885
 2 β”‚ 1.6444
 3 β”‚ 1.78343
 4 β”‚ 1.66938
 5 β”‚ 1.79829

julia> @p table |> map((w = _.weight_kg, h = _.height_cm))
Table with 2 columns and 5 rows:
     w        h
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
 1 β”‚ 55.1194  150.885
 2 β”‚ 55.2233  164.44
 3 β”‚ 51.9789  178.343
 4 β”‚ 59.483   166.938
 5 β”‚ 53.3129  179.829

julia> @p table |> mutate(weight_g = _.weight_kg / 1000)
Table with 5 columns and 5 rows:
     id  group  weight_kg  height_cm  weight_g
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
 1 β”‚ 1   a      55.1194    150.885    0.0551194
 2 β”‚ 4   b      55.2233    164.44     0.0552233
 3 β”‚ 3   a      51.9789    178.343    0.0519789
 4 β”‚ 5   a      59.483     166.938    0.059483
 5 β”‚ 2   b      53.3129    179.829    0.0533129

julia> @p table |> mutate(BMI = _.weight_kg / (_.height_cm / 100)^2)
Table with 5 columns and 5 rows:
     id  group  weight_kg  height_cm  BMI
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
 1 β”‚ 1   a      55.1194    150.885    24.2111
 2 β”‚ 4   b      55.2233    164.44     20.4223
 3 β”‚ 3   a      51.9789    178.343    16.3424
 4 β”‚ 5   a      59.483     166.938    21.3444
 5 β”‚ 2   b      53.3129    179.829    16.486

julia> g = @p table |> group(iseven(_.id))
2-element Dictionaries.Dictionary{Bool, Table{NamedTuple{(:id, :group, :weight_kg, :height_cm), Tuple{Int64, Char, Float64, Float64}}, 1, NamedTuple{(:id, :group, :weight_kg, :height_cm), Tuple{Vector{Int64}, Vector{Char}, Vector{Float64}, Vector{Float64}}}}}
 false β”‚ Table with 4 columns and 3 rows:
     id  group  weight_kg  height_cm
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
 1 β”‚ 1   a      55.1194    150.885
 2 β”‚ 3   a      51.9789    178.343
 3 β”‚ 5   a      59.483     166.938
  true β”‚ Table with 4 columns and 2 rows:
     id  group  weight_kg  height_cm
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
 1 β”‚ 4   b      55.2233    164.44
 2 β”‚ 2   b      53.3129    179.829

julia> @p g |> map(@p(sum(_.weight_kg, _1)))
2-element Dictionaries.Dictionary{Bool, Float64}
 false β”‚ 166.58126233561455
  true β”‚ 108.53621402076169

julia> @p table |> sort(by=-sqrt(_.height_cm))
Table with 4 columns and 5 rows:
     id  group  weight_kg  height_cm
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
 1 β”‚ 2   b      53.3129    179.829
 2 β”‚ 3   a      51.9789    178.343
 3 β”‚ 5   a      59.483     166.938
 4 β”‚ 4   b      55.2233    164.44
 5 β”‚ 1   a      55.1194    150.885

1 Like

DF also sometimes stands for β€œdegrees of freedom”.

I think DataFrame[s]Macros is different enough from DataFramesMeta. At any rate, it’s likely that other DataFrame___.jl packages will be added to the general registry in the future.

Regarding whether the [s] should be included… I suppose it makes more sense grammatically to omit the [s]. :slight_smile:

1 Like

I don’t think it’s a great idea to have two packages DataFramesMeta and DataFramesMacros that both export @transform, @select etc. I think the names are too similar.

Given that the key difference between the two is that in DFMacros, operations are by row by default, perhaps some reference to this could be in the name for now?

Another alternative would be to rename some macros. As far as I know, Stata-based puns are open for the taking, i.e. @generate to make new columns, @keep to keep columns, etc. This would lead to less confusion among new users.

These are pretty standard names within other DataFrames ecosystems (R, Python) and IIUC these macros are doing the same things, but with different defaults / implementation. So having the same names seems correct to me.

3 Likes

Yeah, I don’t think we would want to break the direct correspondences transfrom -> @transform, select -> @select, etc.

1 Like

What about having two sets of macros. β€˜@select’ and β€˜@Cselect’ for by row and by column respectively?

1 Like

Then you can’t pass different modes to one transform call. I considered that and found it too unflexible.

4 Likes

I have still not registered the package due to the name issue, but I’ve added block syntax and an auto-table mode which might be useful for some scenarios. Here’s an excerpt from the Readme again:

@t flag macro for automatic AsTable

To use AsTable as a target, you usually have to construct a NamedTuple in the passed function.
You can avoid both passing AsTable explicitly and constructing the NamedTuple by using the @t flag macro.
All expressions of the type :symbol = expression are collected, the :symbols are replaced with anonymous variables, and these variables are collected in a NamedTuple as the return value automatically.

df = DataFrame(a = 1:3, b = 4:6)
df2 = @transform df @t begin
    x = :a + :b
    :y = x * 2
    :z = x + 4
end
3Γ—4 DataFrame
 Row β”‚ a      b      y      z
     β”‚ Int64  Int64  Int64  Int64
─────┼────────────────────────────
   1 β”‚     1      4     10      9
   2 β”‚     2      5     14     11
   3 β”‚     3      6     18     13
2 Likes

Opinion: You could add Chain.jl as a dependency of DataFrameMacros.jl and re-export the @chain macro.

What do you think @jules? It would be a good idea?

1 Like

Wait, why does a DiffEqFlux macros package have a bunch of tables stuff in it?

3 Likes

I’ve decided to register as DataFrameMacros, DF is not clear enough and that’s just what it is, macros for DataFrames.

8 Likes