[ANN] DataFramesMeta 0.8.0 release

I am happy to announce the 0.8.0 release of DataFramesMeta.

This comes shortly after the 0.7.0 release. We are working quickly and implementing breaking changes in pursuit of a 1.0 release.

The changes are

  • @where has been deprecated in favor of @subset and @subset!. This makes DataFramesMeta.jl more consistent with DataFrames.jl
julia> df = DataFrame(a = [1, 2], b = [3, 4])
2×2 DataFrame
 Row │ a      b     
     │ Int64  Int64 
─────┼──────────────
   1 │     1      3
   2 │     2      4

julia> @subset df begin 
           :a .<= 1
           :b .== 3
       end
1×2 DataFrame
 Row │ a      b     
     │ Int64  Int64 
─────┼──────────────
   1 │     1      3

julia> @subset! df begin 
           :a .<= 1
           :b .== 3
       end;

julia> df
1×2 DataFrame
 Row │ a      b     
     │ Int64  Int64 
─────┼──────────────
   1 │     1      3
  • In transformation macros, i.e. @select, @select!, @transform, @transform!, @by, @combine, the left hand side of the equation, i.e. the new variable being created, needs to be a Symbol. You need to write :y = f(:x) rather than y = f(:x).

    This change was made because it makes the LHS and RHS of a transformation more consistent. Visually, now any time you see :x, it refers to column.

julia> df = DataFrame(a = [1, 2], b = [3, 4])
2×2 DataFrame
 Row │ a      b     
     │ Int64  Int64 
─────┼──────────────
   1 │     1      3
   2 │     2      4

julia> @transform df :c = :a + :b
2×3 DataFrame
 Row │ a      b      c     
     │ Int64  Int64  Int64 
─────┼─────────────────────
   1 │     1      3      4
   2 │     2      4      6

julia> @select df :x = 100
2×1 DataFrame
 Row │ x     
     │ Int64 
─────┼───────
   1 │   100
   2 │   100

julia> @combine df :z = first(:a)
1×1 DataFrame
 Row │ z     
     │ Int64 
─────┼───────
   1 │     1

Old syntax still works, and we will allow a deprecation period.

  • Finally, and this is very niche, we have deprecated returning a Tables.jl-compatible object from @combine and @by without an explicit @astable flag. Previously, you could write
julia> @combine df (x = first(:a), y = last(:b))

This would create the syntax

combine(df, [:a, :b] => (...) => AsTable)

Now we require an explicit use of AsTable on the LHS to do this.

julia> @combine df cols(AsTable) = (x = first(:a), y = last(:b))
1×2 DataFrame
 Row │ x      y     
     │ Int64  Int64 
─────┼──────────────
   1 │     1      4

cols has been the syntax for using column names on the LHS programmatically for a while. So it should be intuitive that cols(AsTable) = ... creates the expression ... => AsTable.

The next planned changes are

  • Continue to make it easier to do row-wise operations, via @rtransform, @rselect, etc. which will be row-wise by default. See here.
  • Replace cols as the syntax to use column names programmatically with $, as in DataFrameMacros.jl. The PR is here.

Enjoy!

10 Likes

Great work, one of these days I’ll actually try using the package :slight_smile:

Just one minor thing, when you say:

is the “not” a typo that shold be “now”?

Thank you. DataFrames is such a key library for those of us coming over from using R for data analysis. However, there is a learning curve. I had gotten used to the concise syntax and speed of data.table in R. When I first began using DataFrames it seemed by comparison overly complicated and opaque. Many times, it seemed one was tripping over their own shoelaces to achieve tasks which were trivial in data.table. Nevertheless, there is an underlying elegance and cleanness to Julia which kept me going. DataFrames would benefit from something like this, and perhaps I will try to put together an equivalent in DataFrames. I have pored over B. Kamiński’s blog posts many times. Every package, blog post, and documentation aid that makes DataFrames easier to understand and use increases adoption.

2 Likes

Thanks for the catch. Fixed.

I am glad you are using it!

I think that we should be able to match data.table speeds on most small to medium datasets, but I don’t have any hard evidence to back that up, aside from the H20 data benchmarks for base dataframes.

I absolutely agree we need more tutorials. I recently ported a dplyr tutorial to DataFramesMeta.jl, but haven’t put the finishing touches on it because the API is going through such rapid changes.

Once we release 1.0, or at least get through the next round of changes, I really want to put effort into tutorials.

That said, please post questions on Discourse or Slack when you get stuck!

9 Likes