Hello, I’m trying to figure out the best way for DataFramesMeta and I want to understand the performance implications of the following.
Go through an expression and find all the symbols that are in the propertynames of the data frame, which is of course not listed in the type information of the data frame.
Write a function with the exact number of inputs as the number of columns referenced. i.e.
df = DataFrame(a = [1, 2], b = [3, 4])
x = 4
@transform(df, c = a .+ b .+ x)
looks at the expression :(a .+ b .+ x) and uses, say, MacroTools to find that :a and :b are columns in the data frame, but not :x. Then I make a function
_f(_a, _b) _a .+ _b .+ c
and make a subsequent src => fun => dest call for DataFrames.transform.
I think this is roughly equivelent to the following
julia> function make_function(df)
pn = propertynames(df)
if :a in pn && :b in pn
[:a, :b] => (function(a, b)
a .+ b
end) => :c1
elseif :a in pn
:a => (function(a)
a .+ 1
end) => :c2
elseif :b in pn
:b => (function(b)
b .+ 100
end) => :c3
end
end
make_function (generic function with 1 method)
julia> transform(df, make_function(df))
In that I make a different function based on what is in the data frame.
My question is: Are there performance costs to this approach? Assuming this is feasible, is defining a function this “late” going to prevent the compiler from making the necessary optimizations?
Can someone point me to more reading on the subject?
It looks like this post indicates there may be performance penalties. But there are so many things people do with metaprogramming, particularly with ForwardDiff and ChainRules etc. that maybe this isn’t a problem.
There is nothing wrong with generating functions on the fly. The real problem with your code is that it is not type stable — every distinct function definition in Julia has its own type, and your code returns a different type of function depending upon the runtime output of propertynames. This will not only slow down execution of these functions (they can’t be inlined), but it will slow down subsequent code that depends on the result of those functions (since the return types aren’t known until runtime).
In contrast, something like foo(array, y) = sum(x -> sqrt(abs2(x) + y), array)also generates a function as part of its execution, but it generates the same function type for any given type of y independent of the value, so it is type stable and Julia will generate fast code for it.
As a more general comment regarding the request of @pdeffebach the context of the use case he is asking about is the following:
In DataFrames.jl we do not specialize on the function type anyway (as it was leading to recompilation with every new function passed - most of the time the compilation cost was bigger than computation cost because of this). Additionally the result type of the computation is DataFrame which is type unstable by design.
In this context, though, the actual execution of fun in src => fun => destis type-stable, right? Since fun is defined earlier and there is a function barrier between taking the vectors in source out of the data frame and putting them into fun, right?
(at least this is my understanding how things work)
In general @nospecialize is only a hint to the compiler though (so actually it might specialize the call even if you ask it not to do it). But this is beyond my knowledge to give you exact rules what happens when.
@stevengj can perhaps comment in more detail, but in the meantime we should add benchmarks for functions inside transform compared to their counterparts outside of transform. My intuition is that they are the same in this context.
f is type unstable, but it does not mean that it will be slow - once a dispatch to sum is done later things are fast (but as @stevengj the type instability will propagate - in general it would be a problem, but not for DataFrame which is type unstable anyway).
And if you generate the function dynamically you pay the compilation cost every time:
It would be great if you run independent benchmarks on the PR I linked. I have run such benchmarks and that is why I have added @nospecialize there (but benchmarking is tricky so it would be great to have some independent tests).
I just ran a similar benchmark, and it confirms the logic.
julia> df = DataFrame(a = rand(1_000_000), b = rand(1_000_000));
julia> a = df.a; b = df.b;
julia> function comp(a, b)
ma = mean(a)
mb = mean(b)
c = cor(a, b)
z = (a .- mb ./ std(a))
return b .+ mb .- z .* c
end;
julia> function make_fun_2(df)
pn = propertynames(df)
if :a in pn && :b in pn
[:a, :b] => function(a, b)
ma = mean(a)
mb = mean(b)
c = cor(a, b)
z = (a .- mb ./ std(a))
return b .+ mb .- z .* c => :c
end => :c
else
[:a, :b] => ((a, b) -> fill("hello", length(a))) => :c
end
end;
julia> @btime comp(a, b);
5.794 ms (4 allocations: 15.26 MiB)
julia> @btime transform!(df, [:a, :b] => comp => :c);
6.067 ms (88 allocations: 15.26 MiB) # within a margin of error from the above
julia> @btime transform!(df, make_fun_2(df));
8.109 ms (97 allocations: 30.52 MiB)