DataFramesMeta.jl version 0.11.0 Release

Hi all,

I’m excited to announce a new release of DataFramesMeta.jl.

This feature adds one major new feature, which is the ability to use AsTable on the right-hand-side of transformations. This makes it easy to work with many columns at once programatically. In particular, it allows one to emulate Stata’s rowmean function.

julia> using DataFramesMeta, Statistics;

julia> df = DataFrame(rand(10, 100), :auto); # A wide data frame

julia> @rselect df :row_mean = mean(AsTable(:))
10×1 DataFrame
 Row │ row_mean 
     │ Float64  
─────┼──────────
   1 │ 0.495727
   2 │ 0.478012
   3 │ 0.449286
   4 │ 0.457304
   5 │ 0.508363
   6 │ 0.470989
   7 │ 0.49183
   8 │ 0.450141
   9 │ 0.489021
  10 │ 0.484617

Here AsTable works just the same as AsTablein DataFrames.jl. Behind the scenes in a transform call, we pass a NamedTuple of vectors (or in the row-wise case, a plain old NamedTuple) to the underlying function.

In the above example, it would appear I the mean of a 100-element NamedTuple, which normally carries with it large compilation costs. But thanks to great work by @bkamins and @nalimilan, DataFrames.jl uses a faster path which never materializes the named tuples, see #2869 for more details. Thanks to Julia’s modularity, DataFramesMeta.jl benefits from this excellent work.

In the future, I plan to add a @collect macro-flag (similar to @passmissing and @byrow) to let end-users take advantage of this fast path themselves.

There are additional compilation improvements in this release. For example, :y = f(g(:x)) used to expand to an anonymous function, meaning the same transformation twice in separate places would incur a compilation cost each time. Now, howver, :y = f(g(:x)) get’s expanded to :y = (f ∘ g)(:x), whose compilation is re-used.

This will make DataFramesMeta.jl feel more snappy with long @chains of operations.

Setup:

julia> using DataFrames, DataFramesMeta

julia> df = DataFrame(x = [1, 2]);

julia> function inner(x)
           t = (x .- mean(x) .+ std(x)) .^2 
           t ./ t[1]
       end;

julia> function outer(x)
           @. (x + 1) * 100 + 60
       end;

julia> @select df :y = outer(inner(:x)); # TTFP compilation

Before:

julia> @time @select df :y = outer(inner(:x)); # First try
  0.024464 seconds (17.83 k allocations: 1.045 MiB, 97.90% compilation time)

julia> @time @select df :y = outer(inner(:x)); # Second try
  0.025492 seconds (17.82 k allocations: 1.041 MiB, 97.88% compilation time)

After:

julia> @select df :y = outer(inner(:x)); # TTFP compilation

julia> @time @select df :y = outer(inner(:x)); # First try
  0.000110 seconds (121 allocations: 6.531 KiB)

julia> @time @select df :y = outer(inner(:x)); # Second try
  0.000107 seconds (121 allocations: 6.531 KiB)

And of course plenty of docs fixes. Thank you to everyone who helped out!

See the News.md here.

In the next release we will add:

  1. A @collect macro-flag for even faster row-wise operations
  2. Keyword arguments. I’ve been procrastinating finishing up the PR for it. If you want to help, the PR is here. Please let me know if you would like to assist!

I think we are getting closer to a 1.0 release.

10 Likes