Hi all,
I’m excited to announce a new release of DataFramesMeta.jl.
This feature adds one major new feature, which is the ability to use AsTable
on the right-hand-side of transformations. This makes it easy to work with many columns at once programatically. In particular, it allows one to emulate Stata’s rowmean
function.
julia> using DataFramesMeta, Statistics;
julia> df = DataFrame(rand(10, 100), :auto); # A wide data frame
julia> @rselect df :row_mean = mean(AsTable(:))
10×1 DataFrame
Row │ row_mean
│ Float64
─────┼──────────
1 │ 0.495727
2 │ 0.478012
3 │ 0.449286
4 │ 0.457304
5 │ 0.508363
6 │ 0.470989
7 │ 0.49183
8 │ 0.450141
9 │ 0.489021
10 │ 0.484617
Here AsTable
works just the same as AsTable
in DataFrames.jl. Behind the scenes in a transform
call, we pass a NamedTuple
of vectors (or in the row-wise case, a plain old NamedTuple
) to the underlying function.
In the above example, it would appear I the mean of a 100
-element NamedTuple
, which normally carries with it large compilation costs. But thanks to great work by @bkamins and @nalimilan, DataFrames.jl uses a faster path which never materializes the named tuples, see #2869 for more details. Thanks to Julia’s modularity, DataFramesMeta.jl benefits from this excellent work.
In the future, I plan to add a @collect
macro-flag (similar to @passmissing
and @byrow
) to let end-users take advantage of this fast path themselves.
There are additional compilation improvements in this release. For example, :y = f(g(:x))
used to expand to an anonymous function, meaning the same transformation twice in separate places would incur a compilation cost each time. Now, howver, :y = f(g(:x))
get’s expanded to :y = (f ∘ g)(:x)
, whose compilation is re-used.
This will make DataFramesMeta.jl feel more snappy with long @chain
s of operations.
Setup:
julia> using DataFrames, DataFramesMeta
julia> df = DataFrame(x = [1, 2]);
julia> function inner(x)
t = (x .- mean(x) .+ std(x)) .^2
t ./ t[1]
end;
julia> function outer(x)
@. (x + 1) * 100 + 60
end;
julia> @select df :y = outer(inner(:x)); # TTFP compilation
Before:
julia> @time @select df :y = outer(inner(:x)); # First try
0.024464 seconds (17.83 k allocations: 1.045 MiB, 97.90% compilation time)
julia> @time @select df :y = outer(inner(:x)); # Second try
0.025492 seconds (17.82 k allocations: 1.041 MiB, 97.88% compilation time)
After:
julia> @select df :y = outer(inner(:x)); # TTFP compilation
julia> @time @select df :y = outer(inner(:x)); # First try
0.000110 seconds (121 allocations: 6.531 KiB)
julia> @time @select df :y = outer(inner(:x)); # Second try
0.000107 seconds (121 allocations: 6.531 KiB)
And of course plenty of docs fixes. Thank you to everyone who helped out!
See the News.md here.
In the next release we will add:
- A
@collect
macro-flag for even faster row-wise operations - Keyword arguments. I’ve been procrastinating finishing up the PR for it. If you want to help, the PR is here. Please let me know if you would like to assist!
I think we are getting closer to a 1.0 release.