Hello together.
I want to showcase some performance characteristics and slight problems with ‘homemade closures’. By that I mean some kind of struct that carries some data and has a call method in which this data is used, something like Base.Fix1
. In case you don’t know what that is, here is the (simplified) definition for it (from operators.jl)
struct Fix1{F,T} <: Function
f::F
x::T
end
(f::Fix1)(y) = f.f(f.x, y)
I got send down this rabbit hole after starting to toy with the package PartialFunctions
. This is a package that allows you to create new functions from existing ones by applying some of its arguments to it. The core Idea behind it is to define an operator f $ x
, which creates a data structure very similar in functionality to Fix1(f,x)
, but with more bells and whistles for chaining them together, like f $ x $ y $ z
. After initial excitement and rewriting some code with partial function application I noticed various performance regressions. I made a lot of benchmarks and performance experiments with it, and ended up rewriting the entire thing (twice).
I managed to identify the problems in the end. Annoyingly one of them is pretty closely tied to the compilers most inner workings, and at least as far as I can see there exists no solution for it. I wanted to share my findings with the community anyways, maybe someone will find it interesting.
What’s the problem?
Have a look at the following code:
using BenchmarkTools
"""
the point of this function is to be very expensive in its first argument and
cheap in its second argument.
"""
expensiveFirst(x, y) = x^1.5 + 2y
# some arbitrary data
data = rand(1:100, 1_000)
@btime p1 = mapreduce(x -> expensiveFirst(1, x), +, $data)
# -> ca 107ns on my machine
@btime p2 = mapreduce(Base.Fix1(expensiveFirst, 1), +, $data)
# -> ca 3.3µs on my machine
x -> expensiveFirst(1, x)
and Fix1(expensiveFirst, 1)
do pretty much exactly the same thing, yet the second one is orders of magnitudes slower. What happened here? Obviously, in the first case, the compiler realized that the first argument will always be 1, and evaluated some part of the function at compile time, so the expensive part of the computation (the x^1.5
part) does not need to happen 1000 times at runtime.
In the second case this does not happen, because when the code is compiled, the type of Fix1{typeof(expensiveFirst), Int}
does not contain the information that the first argument will always be 1, while this is the case for the anonymous function. So can we fix this somehow?
Well… kinda. Have a look at this slightly contrived type definition:
struct ValueFix1{F, X}
f::F
end
(f::ValueFix1{F,X})(y) where {F,X} = f.f(X,y)
ValueFix1(f::F, x) where F = ValueFix1{F,x}(f)
@btime p2 = mapreduce(ValueFix1(expensiveFirst,1), +, $data)
# -> ca 107ns on my machine - just as fast as the native lambda expression
Instead of saving the 1 in the struct, I wrote it directly into the type signature now. This allows for proper constant propagation again. Problem solved? Not quite.
Of course, the $
operator or Fix1
could be redefined to return something more like the ValueFix1
struct, which would allow for better constant propagation when values are known at compile time, but bring everything to a screeching halt due to massive type instabilities when they are not. Not very helpful!
The only thing a package like PartialFuncitons
could do is to define two different operators, one for compile time known constants and one for everything else, but it is a bit of a hard sell to have a package which is supposed to beautify and enhance your syntax at the same time wanting to offload some of the compilers work on the programmer.
Can this somehow be fixed?
I see no viable solution to this. The only ways this might get fixed, as far as I can see, are:
- The julia compiler gets an upgrade to optimize much more aggressively around captured values
- Operators, and functions in general, get a way to know if their argument is a compiler-known constant (maybe via dispatch on
Core.Const
? Also, this is probably antithetical to the entire optimization philosophy, since compilers generally try to only make optimizations that change nothing except the performance, and with this feature the pure act of attempting optimization could change the output of the program).
None of these are on the horizon or a priority right now, at least afaik. I’m sure there is much more important stuff to do.
I also tried to force the compiler to inline the calls, but that did not seem to do anything either.
What is affected?
Any type that acts as some kind of function/closure that saves some values needed in the functions evaluation.
where does that leave us?
So how bad is it? Probably not that bad. While my expensiveFirst
example had an almost 30 fold slowdown, it is pretty contrived, and any real world application would likely not suffer nearly as large a performace penalty, if any. Still, I think for those that really want to optimize their code down to the last few instructions, this might be valuable information, so you they might consider avoiding too heavy usage of such structs.
As mentioned, I do not think that a solution to this exists, so I do not expect one. But hopefully someone found this interesting, or entertaining, or learned something from it. And who knows, maybe someone out there turns out to be more creative than me .