Understanding the performance costs of generating functions on the fly

Hello, I’m trying to figure out the best way for DataFramesMeta and I want to understand the performance implications of the following.

  1. Go through an expression and find all the symbols that are in the propertynames of the data frame, which is of course not listed in the type information of the data frame.
  2. Write a function with the exact number of inputs as the number of columns referenced. i.e.
df = DataFrame(a = [1, 2], b = [3, 4])
x = 4
@transform(df, c = a .+ b .+ x)

looks at the expression :(a .+ b .+ x) and uses, say, MacroTools to find that :a and :b are columns in the data frame, but not :x. Then I make a function

_f(_a, _b) _a .+ _b .+ c

and make a subsequent src => fun => dest call for DataFrames.transform.

I think this is roughly equivelent to the following

julia> function make_function(df)
       pn = propertynames(df)
       if :a in pn && :b in pn
           [:a, :b] => (function(a, b)
               a .+ b
           end) => :c1
       elseif :a in pn 
           :a => (function(a)
               a .+ 1
           end) => :c2
       elseif :b in pn
           :b => (function(b)
               b .+ 100
           end) => :c3
       end
       end
make_function (generic function with 1 method)
julia> transform(df, make_function(df))

In that I make a different function based on what is in the data frame.

My question is: Are there performance costs to this approach? Assuming this is feasible, is defining a function this “late” going to prevent the compiler from making the necessary optimizations?

Can someone point me to more reading on the subject?

It looks like this post indicates there may be performance penalties. But there are so many things people do with metaprogramming, particularly with ForwardDiff and ChainRules etc. that maybe this isn’t a problem.

There is nothing wrong with generating functions on the fly. The real problem with your code is that it is not type stable — every distinct function definition in Julia has its own type, and your code returns a different type of function depending upon the runtime output of propertynames. This will not only slow down execution of these functions (they can’t be inlined), but it will slow down subsequent code that depends on the result of those functions (since the return types aren’t known until runtime).

In contrast, something like foo(array, y) = sum(x -> sqrt(abs2(x) + y), array) also generates a function as part of its execution, but it generates the same function type for any given type of y independent of the value, so it is type stable and Julia will generate fast code for it.

8 Likes

Thank you, this is the decisive answer I was looking for. I will avoid a naive attempt to construct functions in this way for this reason.

As a more general comment regarding the request of @pdeffebach the context of the use case he is asking about is the following:

In DataFrames.jl we do not specialize on the function type anyway (as it was leading to recompilation with every new function passed - most of the time the compilation cost was bigger than computation cost because of this). Additionally the result type of the computation is DataFrame which is type unstable by design.

2 Likes

In this context, though, the actual execution of fun in src => fun => dest is type-stable, right? Since fun is defined earlier and there is a function barrier between taking the vectors in source out of the data frame and putting them into fun, right?

the actual execution of fun in src => fun => dest is type-stable

I am not sure what you mean by “is type-stable” here. The caller function does not specialize on fun as you can see here: Allow multicolumn transformations for AbstractDataFrame by bkamins · Pull Request #2461 · JuliaData/DataFrames.jl · GitHub.

However, once you call the function, e.g. here Allow multicolumn transformations for AbstractDataFrame by bkamins · Pull Request #2461 · JuliaData/DataFrames.jl · GitHub this forms a barier and inside what the the function does is executed efficiently.

(at least this is my understanding how things work)

In general @nospecialize is only a hint to the compiler though (so actually it might specialize the call even if you ask it not to do it). But this is beyond my knowledge to give you exact rules what happens when.

@stevengj can perhaps comment in more detail, but in the meantime we should add benchmarks for functions inside transform compared to their counterparts outside of transform. My intuition is that they are the same in this context.

This is what I mean - but this should be also possible with a dynamically generated function the way you originally proposed to do this.

For me this is a case of the following scenario:

julia> f(fun, x) = fun[1](x)
f (generic function with 1 method)

julia> @code_warntype f(Any[sum], 1:10^9)
Variables
  #self#::Core.Compiler.Const(f, false)
  fun::Array{Any,1}
  x::UnitRange{Int64}

Body::Any
1 ─ %1 = Base.getindex(fun, 1)::Any
│   %2 = (%1)(x)::Any
└──      return %2

julia> @time f(Any[sum], 1:10^9)
  0.000004 seconds (3 allocations: 144 bytes)
500000000500000000

f is type unstable, but it does not mean that it will be slow - once a dispatch to sum is done later things are fast (but as @stevengj the type instability will propagate - in general it would be a problem, but not for DataFrame which is type unstable anyway).

And if you generate the function dynamically you pay the compilation cost every time:

julia> @time f(Any[x -> sum(x)], 1:10^9)
  0.004668 seconds (3.56 k allocations: 157.237 KiB)
500000000500000000

julia> @time f(Any[x -> sum(x)], 1:10^9)
  0.004361 seconds (3.56 k allocations: 157.019 KiB)
500000000500000000

even if the code is type stable as in this case (note that actually the compilation cost here is higher than for type unstable case):

julia> @time f([x -> sum(x)], 1:10^9)
  0.017118 seconds (58.12 k allocations: 3.303 MiB)
500000000500000000

julia> @time f([x -> sum(x)], 1:10^9)
  0.017425 seconds (58.12 k allocations: 3.305 MiB)
500000000500000000
2 Likes

It would be great if you run independent benchmarks on the PR I linked. I have run such benchmarks and that is why I have added @nospecialize there (but benchmarking is tricky so it would be great to have some independent tests).

I just ran a similar benchmark, and it confirms the logic.

julia> df = DataFrame(a = rand(1_000_000), b = rand(1_000_000));

julia> a = df.a; b = df.b;

julia> function comp(a, b)
       ma = mean(a)
       mb = mean(b)
       
       c = cor(a, b)
       
       z = (a .- mb ./ std(a))
       
       return b .+ mb .- z .* c
       end;

julia> function make_fun_2(df)
       pn = propertynames(df)
       if :a in pn && :b in pn 
           [:a, :b] => function(a, b)       
              ma = mean(a)
              mb = mean(b)
              
              c = cor(a, b)
              
              z = (a .- mb ./ std(a))
              
              return b .+ mb .- z .* c => :c
              end => :c
       else
           [:a, :b] => ((a, b) -> fill("hello", length(a))) => :c
       end
       end;

julia> @btime comp(a, b);
  5.794 ms (4 allocations: 15.26 MiB)

julia> @btime transform!(df, [:a, :b] => comp => :c);
  6.067 ms (88 allocations: 15.26 MiB)  # within a margin of error from the above

julia> @btime transform!(df, make_fun_2(df));
  8.109 ms (97 allocations: 30.52 MiB)

Putting aside performance, would that really be a good thing? (for x to refer to a column of the dataframe iff :x is in names(df)).

It sounds like code like tryparse(Float64,x) would work 99% of the time but fail inexplicably whenever a column is named Float64.

1 Like