[ANN-RFC] DFMacros.jl

aplavin · June 17, 2021, 10:14am

Note that many table types already overload common Julia functions such as map. This single API makes it very convenient and consistent to work with them. Add DataPipes.jl (disclaimer: my package) on top to reduce syntactic boilerplate - and turns out that basically all operations from the first post are easy to write for a wide variety of tables!
My translation of those examples to TypedTables.Table is below. It also works as-is with other implementations such as Tables.rowtable (vector of namedtuples). As a bonus, all operations are easily “pipeable” (:

julia> using Random
julia> using SplitApplyCombine
julia> using Tables
julia> using TypedTables
julia> using DataPipes
julia> using Missings

julia> table = Table(
           id = shuffle(1:5),
           group = rand('a':'b', 5),
           weight_kg = randn(5) .* 5 .+ 60,
           height_cm = randn(5) .* 10 .+ 170
       )
Table with 4 columns and 5 rows:
     id  group  weight_kg  height_cm
   ┌────────────────────────────────
 1 │ 1   a      55.1194    150.885
 2 │ 4   b      55.2233    164.44
 3 │ 3   a      51.9789    178.343
 4 │ 5   a      59.483     166.938
 5 │ 2   b      53.3129    179.829

julia> @p table |> map((height_m = _.height_cm / 100,))
Table with 1 column and 5 rows:
     height_m
   ┌─────────
 1 │ 1.50885
 2 │ 1.6444
 3 │ 1.78343
 4 │ 1.66938
 5 │ 1.79829

julia> @p table |> map((w = _.weight_kg, h = _.height_cm))
Table with 2 columns and 5 rows:
     w        h
   ┌─────────────────
 1 │ 55.1194  150.885
 2 │ 55.2233  164.44
 3 │ 51.9789  178.343
 4 │ 59.483   166.938
 5 │ 53.3129  179.829

julia> @p table |> mutate(weight_g = _.weight_kg / 1000)
Table with 5 columns and 5 rows:
     id  group  weight_kg  height_cm  weight_g
   ┌───────────────────────────────────────────
 1 │ 1   a      55.1194    150.885    0.0551194
 2 │ 4   b      55.2233    164.44     0.0552233
 3 │ 3   a      51.9789    178.343    0.0519789
 4 │ 5   a      59.483     166.938    0.059483
 5 │ 2   b      53.3129    179.829    0.0533129

julia> @p table |> mutate(BMI = _.weight_kg / (_.height_cm / 100)^2)
Table with 5 columns and 5 rows:
     id  group  weight_kg  height_cm  BMI
   ┌─────────────────────────────────────────
 1 │ 1   a      55.1194    150.885    24.2111
 2 │ 4   b      55.2233    164.44     20.4223
 3 │ 3   a      51.9789    178.343    16.3424
 4 │ 5   a      59.483     166.938    21.3444
 5 │ 2   b      53.3129    179.829    16.486

julia> g = @p table |> group(iseven(_.id))
2-element Dictionaries.Dictionary{Bool, Table{NamedTuple{(:id, :group, :weight_kg, :height_cm), Tuple{Int64, Char, Float64, Float64}}, 1, NamedTuple{(:id, :group, :weight_kg, :height_cm), Tuple{Vector{Int64}, Vector{Char}, Vector{Float64}, Vector{Float64}}}}}
 false │ Table with 4 columns and 3 rows:
     id  group  weight_kg  height_cm
   ┌────────────────────────────────
 1 │ 1   a      55.1194    150.885
 2 │ 3   a      51.9789    178.343
 3 │ 5   a      59.483     166.938
  true │ Table with 4 columns and 2 rows:
     id  group  weight_kg  height_cm
   ┌────────────────────────────────
 1 │ 4   b      55.2233    164.44
 2 │ 2   b      53.3129    179.829

julia> @p g |> map(@p(sum(_.weight_kg, _1)))
2-element Dictionaries.Dictionary{Bool, Float64}
 false │ 166.58126233561455
  true │ 108.53621402076169

julia> @p table |> sort(by=-sqrt(_.height_cm))
Table with 4 columns and 5 rows:
     id  group  weight_kg  height_cm
   ┌────────────────────────────────
 1 │ 2   b      53.3129    179.829
 2 │ 3   a      51.9789    178.343
 3 │ 5   a      59.483     166.938
 4 │ 4   b      55.2233    164.44
 5 │ 1   a      55.1194    150.885

CameronBieganek · June 17, 2021, 12:27pm

DF also sometimes stands for “degrees of freedom”.

I think DataFrame[s]Macros is different enough from DataFramesMeta. At any rate, it’s likely that other DataFrame___.jl packages will be added to the general registry in the future.

Regarding whether the [s] should be included… I suppose it makes more sense grammatically to omit the [s].

pdeffebach · June 17, 2021, 12:48pm

I don’t think it’s a great idea to have two packages DataFramesMeta and DataFramesMacros that both export @transform, @select etc. I think the names are too similar.

Given that the key difference between the two is that in DFMacros, operations are by row by default, perhaps some reference to this could be in the name for now?

Another alternative would be to rename some macros. As far as I know, Stata-based puns are open for the taking, i.e. @generate to make new columns, @keep to keep columns, etc. This would lead to less confusion among new users.

Satvik · June 17, 2021, 2:00pm

These are pretty standard names within other DataFrames ecosystems (R, Python) and IIUC these macros are doing the same things, but with different defaults / implementation. So having the same names seems correct to me.

CameronBieganek · June 17, 2021, 2:35pm

Yeah, I don’t think we would want to break the direct correspondences transfrom -> @transform, select -> @select, etc.

ggggggggg · June 17, 2021, 2:39pm

What about having two sets of macros. ‘@select’ and ‘@Cselect’ for by row and by column respectively?

jules · June 19, 2021, 2:07pm

Then you can’t pass different modes to one transform call. I considered that and found it too unflexible.

jules · June 19, 2021, 2:08pm

I have still not registered the package due to the name issue, but I’ve added block syntax and an auto-table mode which might be useful for some scenarios. Here’s an excerpt from the Readme again:

`@t` flag macro for automatic `AsTable`

To use AsTable as a target, you usually have to construct a NamedTuple in the passed function.
You can avoid both passing AsTable explicitly and constructing the NamedTuple by using the @t flag macro.
All expressions of the type :symbol = expression are collected, the :symbols are replaced with anonymous variables, and these variables are collected in a NamedTuple as the return value automatically.

df = DataFrame(a = 1:3, b = 4:6)
df2 = @transform df @t begin
    x = :a + :b
    :y = x * 2
    :z = x + 4
end

3×4 DataFrame
 Row │ a      b      y      z
     │ Int64  Int64  Int64  Int64
─────┼────────────────────────────
   1 │     1      4     10      9
   2 │     2      5     14     11
   3 │     3      6     18     13

eliascarv · June 19, 2021, 4:30pm

Opinion: You could add Chain.jl as a dependency of DataFrameMacros.jl and re-export the @chain macro.

What do you think @jules? It would be a good idea?

ChrisRackauckas · June 19, 2021, 4:43pm

Wait, why does a DiffEqFlux macros package have a bunch of tables stuff in it?

jules · June 19, 2021, 6:02pm

I’ve decided to register as DataFrameMacros, DF is not clear enough and that’s just what it is, macros for DataFrames.

Topic		Replies	Views
DataFrames.jl development survey Data question , dataframes	52	2946	September 27, 2020
Frustrated using DataFrames New to Julia dataframes , data_structures	97	10547	April 22, 2022
DataFramesMeta.jl and the state of the DataFrames ecosystem Data	36	4028	April 24, 2020
Release announcements for DataFrames.jl Data announcement , dataframes	190	24519	September 28, 2023
Common API for tabular data backends Data	44	2648	August 28, 2020

[ANN-RFC] DFMacros.jl

@t flag macro for automatic AsTable

Related topics

`@t` flag macro for automatic `AsTable`