DataFramesMeta.jl and the state of the DataFrames ecosystem

I’ve found the existing macros/functions somewhat cumbersome for my purposes – or perhaps for my habits, which come from years of using Stata. Two things in particular:

  • The split-apply-combine operations automatically aggregate DataFrames to the group level when the function returns scalars. That does not work well with an approach where you first construct all variables at a more disaggregate level, and then aggregate up. In my experience, this approach is less prone to human errors than the alternative, which is to construct aggregates using by operations, and perhaps having to join some of them together.
  • Making split-apply-combine operations work in the way you would like to when the data contain missings is non-trivial. It usually involves a mixture of skipmissings and completecases, and by the time I’ve solved it, I long for Stata, which uses only cases where all variables are nonmissing. Of course that’s not always what you want, but it’s pretty straightforward to get what you want.

Again, I do not want to put the blame on the design of the split-apply-combine macros in these packages, it’s just that I seem to be very slow when I use them.

So my goal is to implement something like Stata’s syntax for the basic data wrangling operations, to help me solve my two-language problem (data cleaning in Stata, model estimation in Julia). Julia’s metaprogramming makes that relatively easy, and it’s also a good exercise for myself. So far I’ve been mostly using @with from DataFramesMeta (beyond the functions in DataFrames.jl) and I would be happy if this were to be continued to be supported (or possibly implemented within DataFrames).

2 Likes