A quick proof-of-concept for a macro-less API for DataFrames that's easier to type

I find the DataFrames.jl group-by syntax very sensible. For example

using DataFrames

df = DataFrame(grp = rand(1:8, 100), val = rand(100))

using Pipe

@pipe df |>
  groupby(_, :grp) |>
  combine(_, :val => mean)

The => may be confusing for a beginner ( I am under no illusion that newcomers would come across this post either, but that’s not the point).

The => syntax would confuse R users as they think, why not just write mean(val) or just mean(:val).

Of course DataFramesMeta.jl can offer a macro-based approach to it which isn’t done yet but it works like this

@by(df, :grp, mean(:val))

But it relies on macros. Before you get why macros are needed, it’s of course when a operation has a @ indicating macro and why certain “functions” need to be macros. It does take some underlying understanding.

So I ask the question: can this be done without macros? Well. Yes, but it may not be practical without macros anyway. Here’s how.

Define a new type/structure called DataFrameFn and make elements of that type callable. For example:

struct DataFrameFn
    fn
    sym
end

the above structure keeps track of a callable fn and a sym to contain the column name. If we were to implement a serious example (instead of a POC) we would need to make it more sophisticated but for now this is sufficient to illustrate the point.

Now define Base.sum(sym::Symbol) to return an element of DataFrameFn with the inner fn = sum

Finally, define another method of combine using ::DataFrameFn

function DataFrames.combine(df, fn::DataFrameFn)
   combine(df, fn.sym => fn.fn)
end

and now you can have a macro-less syntax (ironic that I prefer @pipe right)

@pipe df |> 
    groupby(_, :grp) |> 
    combine(_, sum(:val))

Yay!

Rant about trojan-horse types PS this brings me to my previous idea about the a trojan-horse-type where type of the argument can take over any function. E.g.

sum(df, col::DataFrames.ColName) = sum(df[col))

will only overwrite sum for the type DataFrames.ColName. However, I can image somekind of syntax where DataFrames.ColName can be a trojan horse and take ANY function hostage like

<<function>>(df, col::DataFrames.ColName) = begin
  <<function>>(df[col])
end

Now any function using col::DataFrames.ColName as param and has rarity of two will overwritten by the above. the above can be achieve with Cassette.jl but I believe the user needs to run it in a context so not all function are auto replaced like that.

IMO thinking to much about what may be confusing to someone who is either unfamiliar with some tool or is used to some other tool is a distraction when designing an interface. If you go down that route, you can also assume that : will be confusing for users who used it before for ranges etc.

FWIW, I think the symbol => function syntax for DataFrames.jl is pretty clean and consistent, and very nice to use with Pipe.jl. In contrast, your proposed solution has the following disadvantages:

  1. type piracy

  2. throwing namespaces out the window: now symbols are looked up in Base (I guess, since you didn’t provide an implementation)

  3. what about closures?