I find the DataFrames.jl group-by syntax very sensible. For example
using DataFrames
df = DataFrame(grp = rand(1:8, 100), val = rand(100))
using Pipe
@pipe df |>
groupby(_, :grp) |>
combine(_, :val => mean)
The =>
may be confusing for a beginner ( I am under no illusion that newcomers would come across this post either, but that’s not the point).
The =>
syntax would confuse R users as they think, why not just write mean(val)
or just mean(:val)
.
Of course DataFramesMeta.jl
can offer a macro-based approach to it which isn’t done yet but it works like this
@by(df, :grp, mean(:val))
But it relies on macros. Before you get why macros are needed, it’s of course when a operation has a @
indicating macro and why certain “functions” need to be macros. It does take some underlying understanding.
So I ask the question: can this be done without macros? Well. Yes, but it may not be practical without macros anyway. Here’s how.
Define a new type/structure called DataFrameFn
and make elements of that type callable. For example:
struct DataFrameFn
fn
sym
end
the above structure keeps track of a callable fn
and a sym
to contain the column name. If we were to implement a serious example (instead of a POC) we would need to make it more sophisticated but for now this is sufficient to illustrate the point.
Now define Base.sum(sym::Symbol)
to return an element of DataFrameFn
with the inner fn = sum
Finally, define another method of combine
using ::DataFrameFn
function DataFrames.combine(df, fn::DataFrameFn)
combine(df, fn.sym => fn.fn)
end
and now you can have a macro-less syntax (ironic that I prefer @pipe
right)
@pipe df |>
groupby(_, :grp) |>
combine(_, sum(:val))
Yay!
Rant about trojan-horse types
PS this brings me to my previous idea about the a trojan-horse-type where type of the argument can take over any function. E.g.sum(df, col::DataFrames.ColName) = sum(df[col))
will only overwrite sum
for the type DataFrames.ColName
. However, I can image somekind of syntax where DataFrames.ColName
can be a trojan horse and take ANY function hostage like
<<function>>(df, col::DataFrames.ColName) = begin
<<function>>(df[col])
end
Now any function using col::DataFrames.ColName
as param and has rarity of two will overwritten by the above. the above can be achieve with Cassette.jl but I believe the user needs to run it in a context so not all function are auto replaced like that.