Hello, I’m trying to figure out the best way for DataFramesMeta and I want to understand the performance implications of the following.
- Go through an expression and find all the symbols that are in the propertynames of the data frame, which is of course not listed in the type information of the data frame.
- Write a function with the exact number of inputs as the number of columns referenced. i.e.
df = DataFrame(a = [1, 2], b = [3, 4]) x = 4 @transform(df, c = a .+ b .+ x)
looks at the expression
:(a .+ b .+ x) and uses, say, MacroTools to find that
:b are columns in the data frame, but not
:x. Then I make a function
_f(_a, _b) _a .+ _b .+ c
and make a subsequent
src => fun => dest call for DataFrames.transform.
I think this is roughly equivelent to the following
julia> function make_function(df) pn = propertynames(df) if :a in pn && :b in pn [:a, :b] => (function(a, b) a .+ b end) => :c1 elseif :a in pn :a => (function(a) a .+ 1 end) => :c2 elseif :b in pn :b => (function(b) b .+ 100 end) => :c3 end end make_function (generic function with 1 method) julia> transform(df, make_function(df))
In that I make a different function based on what is in the data frame.
My question is: Are there performance costs to this approach? Assuming this is feasible, is defining a function this “late” going to prevent the compiler from making the necessary optimizations?
Can someone point me to more reading on the subject?
It looks like this post indicates there may be performance penalties. But there are so many things people do with metaprogramming, particularly with ForwardDiff and ChainRules etc. that maybe this isn’t a problem.