DataFramesMeta conditionals

I often forget to put the dot when writing conditions. I wonder why we couldn’t just do it without and automate it inside DataFramesMeta with a @. macro. Any thoughts?

So, rather than writing this:

@where(df, :x .> 1)

I can do just this:

@where(df, :x > 1)

I rather like the idea that the symbols are a simple drop in for the actual columns. It might save you a little bit of typing to auto-broadcast or whatever, but at the cost of making the overall semantics more confusing. Considering that people are free to put whatever function they want into the @where statements, including functions they define themselves, I’d think that complicating the semantics would be a bad idea.

See also the following issue: https://github.com/JuliaStats/DataFramesMeta.jl/issues/39

JuliaDBMeta distinguishes row-wise macros (@map, @where, @transform) where your are iterating through rows and symbols correspond to a given field and column-wise macros (@with, @where_vec, @transform_vec) where symbols correspond to columns and you will often need to use dot broadcasting in combination with the latter.

Note that both versions are required, as for example one may want:

@where_vec(df, :a .> mean(:a))

which can’t be achieved row by row.

I wonder whether DataFramesMeta could implement a similar strategy. I’m not sure how easy it is to implement row-wise macros efficiently due to type stability issues with DataFrames, but maybe there are ways around that.

In R dplyr, filter(a > mean(a)) works. It is much less verbose.

It seems to me that unless we have a @byrow! in play, we’re always in column space. I don’t see where the promotion of > to .> would cause confusion; I don’t know what someone would mean in this context with a non-broadcast >. It may be technically hard to achieve, but I support making it all @. if possible.

Addendum: This reminds me of the difference between mean and pmean, likewise max and pmax, min and pmin. That is a tricky place in R/dplyr, it’s basically a local max/min/mean calculation. So this nuance could be confused by the above syntax… maybe.

The only differences I see between filter(a > mean(a)) and @where_vec(:a .> mean(:a)) (JuliaDBMeta, just like dpyr, has a curried version) is the use of symbols to refer to columns and of dot broadcasting for element-wise comparison. I’m really not sure how one can avoid using symbols and just put variable names. Dot broadcasting is necessary because, if we are taking mean(a) it means a is a vector and thus we need to compare element-wise.

The example is in my view interesting because it shows a case where automatic dot broadcasting would not work with @where in DataFrames as one would get v .> mean.(v) which is not the correct thing.

a is a array, mean(a) is a scalar, when compare a and mean(a), I guess that the scalar is automatically converted to a array with the same length. I do not think there should be any confusion. I have used dplyr for my daily work for several years, and it works all well.

Pandas and data.table are a little bit more verbose than dplyr. It is a pain to do data wrangling in Matlab, and Julia seems to use Matlab style syntax for data manipulation. I think such style is great for writing numerical stuff, but for data manipulation, it might be easier to follow R and Python style.

What about this use case?

# keep only observations above the mean of income

@where(df, :income .> mean(:income)

As opposed to

m = mean(df[:income])
@where(df, :income > m)