[ANN] DataFrameMacros v0.2

With DataFrames 1.3, it’s now possible to mutate subsets of DataFrames. The normal approach looks like this:

df = DataFrame(x = 1:5, y = [2, 4, 6, missing, 10])
sdf = subset(df, :y => ByRow(ismissing), view = true)
transform!(sdf, :x => ByRow(x -> 2x) => :y)
df

Because this is a multi-step approach it doesn’t work well with piping / Chain.jl, so it’s natural to introduce some kind of special macro solution for DataFrameMacros.jl v0.2. I’ve iterated back and forth on this and now decided on an optional @subset argument to @transform! and @select!. (The @subset expression by itself could not be executed without a DataFrame argument, so this only works within @transform! and @select!)

Example

julia> df = DataFrame(x = 1:5, y = [2, 4, 6, missing, 10])
5×2 DataFrame
 Row │ x      y       
     │ Int64  Int64?  
─────┼────────────────
   1 │     1        2
   2 │     2        4
   3 │     3        6
   4 │     4  missing 
   5 │     5       10

julia> @transform!(df, @subset(ismissing(:y)), :y = 2 * :x)
5×2 DataFrame
 Row │ x      y      
     │ Int64  Int64? 
─────┼───────────────
   1 │     1       2
   2 │     2       4
   3 │     3       6
   4 │     4       8
   5 │     5      10

julia> @transform!(df, @subset(:x >= 3), :z = :y + :x)
5×3 DataFrame
 Row │ x      y       z       
     │ Int64  Int64?  Int64?  
─────┼────────────────────────
   1 │     1       2  missing 
   2 │     2       4  missing 
   3 │     3       6        9
   4 │     4       8       12
   5 │     5      10       15

# the flag macros like `@c` for column-wise mode also work as usual

julia> @transform!(df, @subset(@c :x .< sum(:x) / length(:x)), :y = 0)
5×3 DataFrame
 Row │ x      y       z       
     │ Int64  Int64?  Int64?  
─────┼────────────────────────
   1 │     1       0  missing 
   2 │     2       0  missing 
   3 │     3       6        9
   4 │     4       8       12
   5 │     5      10       15
7 Likes

Can you pass multiple conditions inside @subset? e.g. @subset(:x > 5, :y < 6)?
And/or are multiple @subset expressions allowed or only one is allowed?

1 Like

Yes multiple conditions are fine, same rules as for the normal @subset macro, that’s why I chose this form so it’s hopefully intuitive what’s allowed. Multiple @subset macros are not allowed currently, I didn’t think that option would help much.

1 Like

Yes - if multiple conditions are allowed then single @subset makes sense. How is @subset applied if you pass GroupedDataFrame to transform! etc.?

It’s applied with ungroup = false so that the transform! call also acts on groups afterwards. Then the original grouped dataframe is returned so the result is still grouped.

Ah - now I see it in the docstring. It is a bit inconsistent as @transform! without @subset returns the data frame underlying GroupedDataFrame.

Why do you prefer to have a different behavior here?

If you wanted to ensure consistency I think the solution would be to add a single check at the beginning of a the macro if an AbstractDataFrame or GroupedDataFrame is passed and just store there what should be returned at the end.

Hm yeah I was unsure about this aspect actually, the difference is that I’m not returning the result of transform! because that wouldn’t have all the rows. So if I do ungroup manually, then I should divert the ungroup = false option that you could pass to the macro so that it disables my own ungrouping, it wouldn’t do anything in the transform! call itself. That also didn’t seem super clear to me, but maybe it’s the better choice?

I think it could work as follows (pseudocode):

We are in @transform! with @subset path:

  1. if AbstractDataFrame store it in some temporary variable tmp
  2. if GroupedDataFrme is passed then:
    • if ungroup=false store parent of the passed GroupedDataFrme in some temporary variable tmp
    • if ungroup=true store the passed GroupedDataFrme in some temporary variable tmp
  3. perform all the operations you perform (just making sure that proper operations are performed - the returned value does not matter)
  4. return tmp

This works because in transform! and select! we know that in the end we should return the original object we were passed.

Ok I will consider changing the behavior to this. The tag didn’t go through yet anyway, so there’s no harm.

Sure - pick whatever behavior you think most appropriate.

My reasoning is that using @subset should not affect the returned object (except of course the fact that it affects the computation).

2 Likes