Using Symbolic simplification in DataFramesMeta macros

In DataFramesMeta it is possible to evaluate algebraic combinations and calculations involving the column names of a DataFrame.
Is it possible (in DataFramesMeta or another package that can be placed inside as part of its macros) to simplify expressions to improve the execution time.

As an example I have the following data:

using DataFrames, DataFramesMeta
using Random

twister = MersenneTwister(1234)
ll = Vector{DataFrame}(undef, 10)
for g in 1:10
ll[g] = DataFrame(g = g, x = rand(twister, 10000), y = rand(twister, 10000));
end
dd = reduce(vcat, ll)
gd = @groupby(dd, :g);

The following all deliver the same result but the first is the fastest. Is there a way for Julia to simplify the other two to the first?

result1 = @time @transform(gd, :result = :y)
#  0.006957 seconds (369 allocations: 4.599 MiB)
result2 = @time @transform(gd, :result = :y + :x - :x + exp.(log.(:y)) - :y)
#  0.203798 seconds (237.28 k allocations: 29.093 MiB, 5.66% gc time, 89.07% compilation time)
result3 = @time @transform(gd, :result = :y/3 + :y/3 + :y/3 + :x/2 + :x/2 +:y - (:x +:y))
# 0.456713 seconds (280.93 k allocations: 37.507 MiB, 3.88% gc time, 90.19% compilation time)

and in the case it is possible, can you define simplification rules involving arbitrary functions. So in the below case you could tell Julia that k * ShiftedArrays.lag(x) = ShiftedArrays.lag(kx) and so it could simplify.

using ShiftedArrays
result4 = @time @transform(gd, :result = ShiftedArrays.lag(:y))
# 0.102200 seconds (73.83 k allocations: 13.625 MiB, 94.17% compilation time)
result5 = @time @transform(gd, :result = ShiftedArrays.lag(:y)/2 + ShiftedArrays.lag(0.5 * :y))
# 0.246605 seconds (259.86 k allocations: 29.362 MiB, 96.41% compilation time)

No, this is definitely not possible in DataFramesMeta.jl.

I don’t know this space at all, but I supposed you could use SymbolicUtils.jl to simplify an expression and then call the function on the columns?

More generally, your timing isn’t actually doing what you think it’s doing.

Your timing is entirely taken up by the work Julia does to compile a new anonymous function. Here’s what’s happening under the hood

  1. DataFramesMeta sees the expression
:y + :x - :x + exp.(log.(:y)) - :y)
  1. And then does a replacement, making a function
foo(x, y) = x + x - x + exp.(log.(y)) - y
  1. It then calls this function on the columns. This is the key part that takes time, because Julia has never seen this function before, it needs to get compiled first. Sometimes this can take a little bit of time (on the order of 0.2 seconds).

This scales with the complexity of the expression. But the timing of all the operations is super super small.

There are still differences in the execution time of doing the functions, but these are a few orders of magnitude smaller than the time differences you observe

julia> @btime foo1(dd)
julia> foo1(dd) = @transform dd :result = :y;

julia> foo2(dd) = @transform dd :result = :y + :x - :x + exp.(log.(:y)) - :y;

julia> foo3(dd) = @transform dd :result = :y/3 + :y/3 + :y/3 + :x/2 + :x/2 + :y - (:x +:y);

julia> @btime foo1($dd);
  172.513 μs (83 allocations: 3.82 MiB)

julia> @btime foo2($dd);
  2.888 ms (122 allocations: 6.87 MiB)

julia> @btime foo3($dd);
  994.355 μs (128 allocations: 9.16 MiB)

Put your Julia code in functions so Julia can cache these anonymous functions created by DataFramesMeta

Yes, it’s frustrating that DataFramesMeta.jl (and all similar packages) can feel sluggish at the REPL because of this compilation time. A considerable amount of effort has been put into reducing this time on the DataFramesMeta side of things, but it’s inevitable.

1 Like

Thanks. Yeah I was aware of the compilation time and tried to remove it by reporting the times the second or third time that I ran the lines of code. My computer is pretty ancient though which probably explains my worse compilation time.

It seems like on your computer too we get foo3 taking several times longer than the other options which suggests a useful timesave is possible by simplifying the functions.

This must be possible in the mapping from the first and second expressions in your answer. I guess something from SymbolicUtils could be used on here to simplify the resultant expression.

To be clear, for DataFramesMeta @time is going to be slow no matter which run it’s on. Because (I think) the anonymous fnuction is created in global scope, Julia doesn’t cache the function. So @time doesn’t help. Though @btime does.

It is highly unlikely the time difference between foo2 and foo3 is going to matter in your code. You should ignore that kind of thing unless you have totally confirmed it’s not a bottleneck.

1 Like

Fair enough. I did not appreciate the differences between @time and @btime.

I think in alot of cases the difference would be meaningful. For instance the second option below is twice as fast as the first but I think most people would naturally just write it in the first way.

result1 = @btime @transform(gd, :result = 10 .* :x .^ 5 + :x .^ 4 .+ :x.^3 .+ 4 * :x .^ 2 .+ 3 * :x .+ 4)
#  2.433 ms (686 allocations: 13.59 MiB)
result2 = @btime @transform(gd, :result = 4 .+ :x .* ( 3 .+ :x .* (4 .+ :x .* (1 .+ :x .* (1 .+ 10 .* :x)  )   )))
# 1.157 ms (526 allocations: 9.01 MiB)

I guess I could just have a go with using something like SymbolicUtils before putting it into DataFramesMeta however.

There are some other issues

Why are you performing an operation on a grouped data frame when you don’t take advantage of the grouping? I would write this as

@rtransform dd :result  = 10 * :x + :x ^2

That will speed up your code way more than using SymbolicUtils. You could also do @rtransform! to mutate the data frame in place and avoid copies.

I would strongly recommend writing things the easy way, not matter how verbose, and then stepping back at the end of your work to examine what is really slowing your code down.

Yeah my code does not look anything like this. I just tried to get a really simple example. I agree that the grouping is pretty redundant in this example.

If you are still frustrated with performance after putting things in functions, then try SymbolicUtils.

But if you are frustrated with performance at the REPL, then Symbolic Utils is going to make things much worse, just because it will do more re-writing and function creation in global scope.