Some notes on performance of callable structs

Hello together.

I want to showcase some performance characteristics and slight problems with ‘homemade closures’. By that I mean some kind of struct that carries some data and has a call method in which this data is used, something like Base.Fix1. In case you don’t know what that is, here is the (simplified) definition for it (from operators.jl)

struct Fix1{F,T} <: Function
    f::F
    x::T
end

(f::Fix1)(y) = f.f(f.x, y)

I got send down this rabbit hole after starting to toy with the package PartialFunctions. This is a package that allows you to create new functions from existing ones by applying some of its arguments to it. The core Idea behind it is to define an operator f $ x, which creates a data structure very similar in functionality to Fix1(f,x), but with more bells and whistles for chaining them together, like f $ x $ y $ z. After initial excitement and rewriting some code with partial function application I noticed various performance regressions. I made a lot of benchmarks and performance experiments with it, and ended up rewriting the entire thing (twice).

I managed to identify the problems in the end. Annoyingly one of them is pretty closely tied to the compilers most inner workings, and at least as far as I can see there exists no solution for it. I wanted to share my findings with the community anyways, maybe someone will find it interesting.

What’s the problem?

Have a look at the following code:

using BenchmarkTools


"""
the point of this function is to be very expensive in its first argument and
cheap in its second argument.
"""
expensiveFirst(x, y) = x^1.5 + 2y


# some arbitrary data
data = rand(1:100, 1_000)

@btime p1 = mapreduce(x -> expensiveFirst(1, x), +, $data)
# -> ca 107ns on my machine


@btime p2 = mapreduce(Base.Fix1(expensiveFirst, 1), +, $data)
# -> ca 3.3µs on my machine

x -> expensiveFirst(1, x) and Fix1(expensiveFirst, 1) do pretty much exactly the same thing, yet the second one is orders of magnitudes slower. What happened here? Obviously, in the first case, the compiler realized that the first argument will always be 1, and evaluated some part of the function at compile time, so the expensive part of the computation (the x^1.5 part) does not need to happen 1000 times at runtime.

In the second case this does not happen, because when the code is compiled, the type of Fix1{typeof(expensiveFirst), Int} does not contain the information that the first argument will always be 1, while this is the case for the anonymous function. So can we fix this somehow?

Well… kinda. Have a look at this slightly contrived type definition:

struct ValueFix1{F, X}
    f::F
end
(f::ValueFix1{F,X})(y) where {F,X} = f.f(X,y)
ValueFix1(f::F, x) where F = ValueFix1{F,x}(f)


@btime p2 = mapreduce(ValueFix1(expensiveFirst,1), +, $data)
# -> ca 107ns on my machine - just as fast as the native lambda expression

Instead of saving the 1 in the struct, I wrote it directly into the type signature now. This allows for proper constant propagation again. Problem solved? Not quite.

Of course, the $ operator or Fix1 could be redefined to return something more like the ValueFix1 struct, which would allow for better constant propagation when values are known at compile time, but bring everything to a screeching halt due to massive type instabilities when they are not. Not very helpful!

The only thing a package like PartialFuncitons could do is to define two different operators, one for compile time known constants and one for everything else, but it is a bit of a hard sell to have a package which is supposed to beautify and enhance your syntax at the same time wanting to offload some of the compilers work on the programmer.

Can this somehow be fixed?

I see no viable solution to this. The only ways this might get fixed, as far as I can see, are:

  1. The julia compiler gets an upgrade to optimize much more aggressively around captured values
  2. Operators, and functions in general, get a way to know if their argument is a compiler-known constant (maybe via dispatch on Core.Const? Also, this is probably antithetical to the entire optimization philosophy, since compilers generally try to only make optimizations that change nothing except the performance, and with this feature the pure act of attempting optimization could change the output of the program).

None of these are on the horizon or a priority right now, at least afaik. I’m sure there is much more important stuff to do.

I also tried to force the compiler to inline the calls, but that did not seem to do anything either.

What is affected?

Any type that acts as some kind of function/closure that saves some values needed in the functions evaluation.

where does that leave us?

So how bad is it? Probably not that bad. While my expensiveFirst example had an almost 30 fold slowdown, it is pretty contrived, and any real world application would likely not suffer nearly as large a performace penalty, if any. Still, I think for those that really want to optimize their code down to the last few instructions, this might be valuable information, so you they might consider avoiding too heavy usage of such structs.

As mentioned, I do not think that a solution to this exists, so I do not expect one. But hopefully someone found this interesting, or entertaining, or learned something from it. And who knows, maybe someone out there turns out to be more creative than me :wink:.

4 Likes

This isn’t capturing or closures, closures capture variables from local scopes, though it’s implemented with callable instances of hidden structs that contain the values.

Probably so, if we still expect to compile per call signature instead of profiling and speculatively optimizing. In that case, 1 cannot be a compile-time constant if it’s not stored in the type of Base.Fix1, you need something like ValueFix1.

This wouldn’t be better with an anonymous function. In your example, x -> expensiveFirst(1, x) was only efficient because it inlines a constant literal 1. If that value actually varies, you’d need to either do x -> expensiveFirst(n, x), which at best is a closure capturing a local n that is implemented like Base.Fix1, or eval several separate functions that inline constants, which introduces recompilation and function type instability.

The type instability of the ValueFix1 constructor can be isolated with function barriers; some runtime type checks outside of the critical code won’t significantly hurt performance.

3 Likes

This is almost a repeat of this recent thread:

My answers there are somewhat relevant here, too.

I’d like to emphasize, though, that your perspective here seems wrong. There’s no “problem” here, nothing “bad”, and nothing to “fix”, although hypothetically there could be opportunities to improve some things, of course. Julia is the leader in compile-time computation among programming languages, given that it allows moving values into the type domain, making Julia more powerful than even the great C++ (whose template meta-programming, and other compilation time computation features apply strictly before run time).

Also, like in the linked thread, I think you may not realize that the limitations involved here are quite essential. We, as Julia programmers, want the Julia compiler to only get involved when we expect it to, which conflicts with your implicit expectation that the compiler would move values into the type domain of its own accord, without being prompted to do something by the programmer.

Taking look at FastClosures.jl and its associated issues may give you some insight.

I’m not sure if the following solves the problem you highlighted but maybe the type instability issue is not actually too bad?

julia> function curry(f, x)
           function currier(f, x::Val{T}) where T
               let
                   (args...)->f(T, args...)
               end
           end
           if isbits(x)
               currier(f,Val(x))
           else
               let
                   (args...)->f(x, args...)
               end
           end
       end
curry (generic function with 1 method)

julia> const ⋌ = curry
curry (generic function with 1 method)

julia> using BenchmarkTools

julia> @btime p1 = mapreduce(x -> expensiveFirst(1, x), +, $data)
  230.756 ns (0 allocations: 0 bytes)
100306.0

julia> @btime p3 = mapreduce(expensiveFirst ⋌ 1, +, $data)
  229.704 ns (0 allocations: 0 bytes)
100306.0

julia> (expensiveFirst ⋌ 1 ⋌ 2)()
5.0

help?>  ⋌
"⋌" can be typed by \rightthreetimes<tab>

julia> @btime p1 = mapreduce(x -> expensiveFirst(BigInt(1), x), +, $data)
  329.480 μs (11998 allocations: 546.77 KiB)
100306.0

julia> @btime p3 = mapreduce(expensiveFirst ⋌ BigInt(1), +, $data)
  262.448 μs (10000 allocations: 507.75 KiB)
100306.0

Type instabilities are unavoidable in some circumstances such as this when you both want to be general while taking on compile time parameters. In this case, you want a specialized version of an arbitrary function to be compiled.

Type instabilities can be managed in Julia by dispatching at functional barriers. It really depends if the extra compilation is worth while or not.

3 Likes

Maybe I should be a bit more concrete about the problem I was trying to solve. Julia has pretty good support for function pipelines or higher order functions that take other functions as arguments, like

arr2 = map(x -> foo(1, x), arr1)

or

# from my implementation of a circular buffer for an exercism exercise
Base.getindex(cb::CircularBuffer, i) =
   # wraps the index over the internal array
    checkedindex(cb,i) |>
    x -> getindex(cb.data, x)

These are already pretty nice, but there are a lot of attempts out there to make them even more ergonomic and concise, like Underscores.jl or the aforementioned PartialFunctions.jl, with which these could be rewritten as

arr2 = map(foo $ 1, arr1)

and

# from my implementation of a circular buffer for an exercism exercise
Base.getindex(cb::CircularBuffer, i) =
    checkedindex(cb,i) |>
    getindex $ cb.data

I personally really love this style of generating small new functions, and I am certain I am not alone in this given the many attempts to retrofit such features onto julia (afaik there has been a lengthy discussion to essentially make the __ syntax from Underscores.jl a native language feature). It is quite amazing that julia allows you to just ‘add new syntax’ like this, and in many cases this can even happen completely without loss of performance, due to the way functions are integrated in the type system.

However, in cases where a constant literal is partially applied to a function, like in my map(x -> foo(1, x), arr1) example, the native arrow syntax will create bytecode where the 1 is inlined into the code (as mentioned by @Benny), while for the map(foo $ 1, arr1) this does not happen, leading to slightly reduced performance in some cases. I spend a bit of time thinking about how one could rewrite/improve the definition of the $ function to remove these last drawbacks of partial application compared to the native -> syntax, but so far I think there is none. By stuffing the 1 in a struct that acts as a function we ‘hide’ the fact that it is really just a literal constant from the compiler.

Yes, but that is the whole point. With a native anonymous function the compiler ‘sees’ which values are literals and which are variables and from which scope they come. By using these function structs every value is treated the same, no matter where they come from. That it is in fact better can be seen from the benchmark I gave (you can replace the Fix1(expensiveFirst,1) with expensiveFirst $ 1, they are functionally the same).

@nsaiko :I skimmed over the linked thread. The topics are related I think (constant propagation/folding. BTW thanks to your comment I realized I should read up these terms and understand the difference) but as I understood it it’s author did not get the answer he was looking for, which would reinforce my statement that there exists no solution.
I read your explanations here, and I understand the difference between closed_function_without_run_time_dispatch and closed_function_with_run_time_dispatch, it is similar in concept to Fix1 vs ValueFix1 ,The reason this does not help me in my case is that when I define the function $(f, x) I can assume that the nature of f is known, but whether x is a compile time known literal ( like in map(foo $ 1, arr1)) or a variable that can vary at runtime (like in checkedindex(cb,i) |> getindex $ cb.data) is not, so I must assume it is a variable and treat it accordingly. I have to admit I have never used the Base.@assume_effects macro. Not sure if that may help out here.

I also want to emphasize that this thread is absolutely not a complaint, I would agree that the way the type system allows you to ‘talk’ to the compiler and make it evaluate stuff at compile time is absolutely busted, and way ahead of any other programming language I have ever worked with (I have heard Zig has some interesting comptime mechanic?). I just happened to run across this problem of extending the syntax without increasing the runtime cost which I could not solve and decided to share it.

1 Like

Yes, and my point is that there is nothing that can possibly be improved here, at least from a certain perspective:

Basically, you need to decide whether you’re:

  1. OK with the compiler getting involved at run time; in which case just put the relevant values into the type domain, which will cause a run time dispatch, the compiler will get involved on-demand and the values that are now in the type domain become compile time constants. This is the feature that separates Julia from competing programming languages, although it’s not something that one would want to use often, given that getting the compiler involved is expensive.

  2. Not OK with the compiler getting involved at run time; in which case just don’t do any run time dispatch (check with JET.jl if necessary). This way you’re basically limiting yourself to the power of a static language like C or Rust.

Arguably, you want a new syntax for partial function application, and accordingly a macro should do:

macro ($)(f, args...)
    :((moreargs...) -> $(esc(f))($(esc.(args)...), moreargs...))
end

Unfortunately, needs some additional brackets, but runs at the same speed as native lambda functions:

julia> @btime mapreduce(x -> expensiveFirst(1, x), +, $data);
  1.569 μs (0 allocations: 0 bytes)

julia> @btime mapreduce((@$ expensiveFirst 1), +, $data);
  1.571 μs (0 allocations: 0 bytes)
3 Likes

Yeah, now I think I may have misinterpreted OP’s goals?

I think it is quite interesting that the yielded such a performance improvement when applied to the BigInt type :thinking:.
Your definition for curry becomes unstable if the value for x is unknown at compile time. As an example check out these two equivalent and definitely not contrived functions:

#x -> 2^x % x 
# sorry, could not make up a better example.
foo1(itr ) = sum(itr) do x
        x |>
        ((^) ⋌ 2) |>
        flip(%) ⋌ x
    end

foo2(itr ) = sum(itr) do x
        x |>
        ((^) $ 2) |>
        flip(%) $ x
    end

@btime foo1($(rand(100) * 12))
@btime foo2($(rand(100) * 12))

The problem is that flip(%) ⋌ x is not type stable, since its type is dependent on the value of x, which is not known, while flip(%) $ x is type stable.

I never heard of FastClosures.jl before. As I understand it this used to be a workaround in earlier julia versions that is no longer necessary?

New syntax for partial function application would be amazing imo, but this is a topic that has been rolled out from all possible angles in other threads before, so I don’t necessarily want to start that discussion again.

Macros for this purpose already exist, my favorite implementation is from Underscores.jl. I personally prefer macro-less syntactic sugar, which is why I explored the PartialFunctions.jl package, since Underscores needs you to prefix anything with a macro annotation. Not the end of the world, but kind of inconvenient.

I think what I ultimately want is just to share my findings. I always thought that with clever enough method and type definition I could essentially make the compiler generate code however I want, so when I stumbled upon this problem which I could not solve I was a bit surprised. Apparently callable structs where intermediate values are stored act as some form of ‘barriers’ for the compilers optimization routines, which is not what I would have expected.
This is also not really unique to PartialFunctions, that is just where I found this phenomenon and where it probably surfaces the most often.

In rereading my own post I have to agree that its purpose is kind of unclear. I personally do not really need absolute top-end performance, this whole excursion has been more a hobby for me. I don’t really want an answer, but I spend quite a lot of time trying to fix it, and at the end of the day wanted to at least document the results.
Maybe I should have flagged this as Offtopic instead of Performance?

See also this issue and discourse thread about the same observation:

1 Like

I may have missed something. Is there an implementation of $ available somewhere? I was not where aware that $ was a valid binary operator symbol.

Also what is flip? Do you mean flip as in Haskell?

See the mentioned PartialFunctions.jl.

There’s no barrier, it’s just that in one case you have a constant and in the other case you don’t. In your initial post you seem to understand this distinction very well, so I’m not sure what’s still missing.

julia> fully_compiled(::Val{f}) where {f} = x -> f(x)
fully_compiled (generic function with 1 method)

julia> fully_compiled(f::F) where {F} = fully_compiled(Val(f))
fully_compiled (generic function with 2 methods)

julia> expensiveFirst(x, y) = x^1.5 + 2y
expensiveFirst (generic function with 1 method)

julia> fun(f::F, it) where {F} = mapreduce(f, +, it)
fun (generic function with 1 method)

julia> const f1 = x -> expensiveFirst(1, x)
#3 (generic function with 1 method)

julia> const f2 = Base.Fix1(expensiveFirst, 1)
(::Base.Fix1{typeof(expensiveFirst), Int64}) (generic function with 1 method)

julia> const f3 = fully_compiled(f2)
#1 (generic function with 1 method)

julia> const data = rand(1:100, 1_000);

julia> using BenchmarkTools

julia> @btime fun(f1, data) setup=(data = $data;)
  473.179 ns (0 allocations: 0 bytes)
100088.0

julia> @btime fun(f2, data) setup=(data = $data;)
  4.878 μs (0 allocations: 0 bytes)
100088.0

julia> @btime fun(f3, data) setup=(data = $data;)
  472.622 ns (0 allocations: 0 bytes)
100088.0

julia> Base.issingletontype.(typeof.((f1, f2, f3)))
(true, false, true)

Does this help? To clarify, this presents a way of making Base.Fix1(expensiveFirst, 1) equivalent with x -> expensiveFirst(1, x) from a performance perspective by moving it into the type domain. The ramifications were already discussed quite a bit.

2 Likes

A big caveat is not that not everything can be moved into the type domain.

julia> Val(zeros(5))
ERROR: TypeError: in Type, in parameter, expected Type, got a value of type Vector{Float64}
2 Likes

As thoroughly discussed in one of the linked topics.

Perk of a dynamic language you can compute const globals at runtime and embed them into method definitions, if the compiler won’t do it for you. You don’t even need a const global variable, you can interpolate the value directly in an eval/@eval-ed expression, though the const variable is a lot less messier if the expression for the value is a lot to type, and small expressions are often computed at compile-time anyway.

This however doesn’t cover compile-time constants computed from argument types. In practice these are so cheap the compiler can handle it, but if not, then you’d need to contend with @generated functions and their many limitations.

1 Like