I find the DataFrames.jl group-by syntax very sensible. For example
using DataFrames df = DataFrame(grp = rand(1:8, 100), val = rand(100)) using Pipe @pipe df |> groupby(_, :grp) |> combine(_, :val => mean)
=> may be confusing for a beginner ( I am under no illusion that newcomers would come across this post either, but that’s not the point).
=> syntax would confuse R users as they think, why not just write
mean(val) or just
DataFramesMeta.jl can offer a macro-based approach to it which isn’t done yet but it works like this
@by(df, :grp, mean(:val))
But it relies on macros. Before you get why macros are needed, it’s of course when a operation has a
@ indicating macro and why certain “functions” need to be macros. It does take some underlying understanding.
So I ask the question: can this be done without macros? Well. Yes, but it may not be practical without macros anyway. Here’s how.
Define a new type/structure called
DataFrameFn and make elements of that type callable. For example:
struct DataFrameFn fn sym end
the above structure keeps track of a callable
fn and a
sym to contain the column name. If we were to implement a serious example (instead of a POC) we would need to make it more sophisticated but for now this is sufficient to illustrate the point.
Base.sum(sym::Symbol) to return an element of
DataFrameFn with the inner
fn = sum
Finally, define another method of
function DataFrames.combine(df, fn::DataFrameFn) combine(df, fn.sym => fn.fn) end
and now you can have a macro-less syntax (ironic that I prefer
@pipe df |> groupby(_, :grp) |> combine(_, sum(:val))
Rant about trojan-horse typesPS this brings me to my previous idea about the a trojan-horse-type where type of the argument can take over any function. E.g.
sum(df, col::DataFrames.ColName) = sum(df[col))
will only overwrite
sum for the type
DataFrames.ColName. However, I can image somekind of syntax where
DataFrames.ColName can be a trojan horse and take ANY function hostage like
<<function>>(df, col::DataFrames.ColName) = begin <<function>>(df[col]) end
Now any function using
col::DataFrames.ColName as param and has rarity of two will overwritten by the above. the above can be achieve with Cassette.jl but I believe the user needs to run it in a context so not all function are auto replaced like that.