Immediately executing closure as function barrier

It’s common for code to do a bunch of allocations up front, that may or may not be type stable, and then call out to another function to do some actual work with those allocations. This ‘function barrier’ enables the kernel function to be type stable and ensures the compiler can optimise its work well.

However, as a matter of style, I quite dislike these kernel functions, especially when they are called only once and only by the parent function. Additionally, if there’s a bunch of state that needs to be passed to the kernel, this can get increasingly messy.

My goto at the moment is to create an immediately executing closure, for example:

function myfunc(args...)
    arr1 = allocatingcall()
    arr2 = ...

    # Create a closure and immediately execute
    (() -> begin
        for i in eachindex(arr1)
            for j in eachindex(arr2)
                # Some hotpath work
            end
        end
    end)()

    # Do any cleanup needed
    return result
end

I like this because I don’t have stray, one-time functions, and I don’t need to worry about explicitly passing all my state. As a downside, theres a small performance penalty which I assume (?) comes from closing over the function state?

My question: as a matter of style, what do people think of this? And is there already a macro or function which does this that might be a little more semantic?

1 Like

Why would you use that closure at all, in that example, instead of simply running the for loops?

If the construction of arr1 and arr2 is not type stable, that will propagate into the closure, since they are not passed as parameters. If you pass them as parameters, I would opt for a regular function.

1 Like

The closure in that function allows for a massive speedup - from about 90 seconds to ~3 seconds.

And it is indeed due to the type instability, as manually inserting type assertions to each of arr1, etc. also provides the same speedup (and allows for the loop to remain inside the function).

So I can’t speak to how closures are implemented in Julia, but in this case it definitely improves the speed.

Closures are implemented as instances of hidden global structs that hold the captured variables in their fields. The method part of the closure is a hidden global method belonging to and dispatching on that struct, something like function (_h::_Hidden)() ... _h.arr1 ... _h.arr2 .... If that seems odd, recall that a function is an instance of a singleton type, so besides the global name being attached to the function rather than its type, it’s the same mechanism for a type with callable instances. The documentation on Function-like objects describes this.

IIRC, and this would explain the type stability speedup you observed, the struct in your case would be parametric, something like struct _Hidden{T,S} arr1::T arr2::S end. So, different arr1 and arr2 types at runtime result in different closure types, and the hidden method is compiled specially for each type.

The downside of this approach is that the type inference fails if you reassign the captured variables, even once. It makes sense when you realize that it’s inherently infeasible to infer the type of a variable shared by 2 separate methods; even if the 1st called method does not change the type after the closure is instantiated, it cannot be compiled to assume the 2nd method does not reassign the variable a different type, unless the parser has proven reassignment is absent. Weirdly, the inference also fails even if the reassignment happens before the closure is instantiated. For example, try adding arr1 = arr1 before arr2; you’ll see Core.Box in the @code_warntype indicating the compiler gave up on inferring arr1. You can also see that the closure is 1 type parameter short, something like struct _Hidden{S} arr1::Any arr2::S end. The only straightforward way to fix this is explicit type annotations of the shared variables, much like how explicit types of callable instances would have annotated fields that cannot be reassigned different types.

My advice, pass arr1 and arr2 as arguments instead of captured variables; that way, reassignments don’t sabotage the barrier function’s inference. Don’t worry, such a nested method will work like a closure with 0 captured variables, a hidden global type with a hidden global method, not a repeatedly created function that requires compilation every time. A more minor suggestion is to use function () end syntax for multi-line anonymous functions because () -> begin end syntax isn’t shorter anymore.

2 Likes

I think you are in a uncertainty zone there, I don’t see from that code specifically why the closure would help. Maybe if you can manage to provide a MWE it would be clearer what is actually going on.

For example, here I tried to use the pattern you showed, and as you can see, it has no effect on the performance. It may be, then, that the example is not actually capturing what is going on in your actual code.

julia> function myfunc()
                  arr1 = Real[ rand() for _ in 1:1000 ]
                  arr2 = Real[ rand() for _ in 1:1000 ]
                  s = 0.0
                  (() -> begin
                      for i in eachindex(arr1)
                          for j in eachindex(arr2)
                              s += arr1[i]*arr2[j]
                          end
                      end
                   end)()
                   return s
               end
myfunc (generic function with 1 method)

julia> myfunc()
244056.41061885754

julia> @btime myfunc()
  24.490 ms (2002003 allocations: 30.56 MiB)
244639.28569109805

julia> function myfunc2()
                  arr1 = Real[ rand() for _ in 1:1000 ]
                  arr2 = Real[ rand() for _ in 1:1000 ]
                  s = 0.0
                  for i in eachindex(arr1)
                          for j in eachindex(arr2)
                              s += arr1[i]*arr2[j]
                          end
                    end
                   return s
               end
myfunc2 (generic function with 1 method)

julia> @btime myfunc2()
  24.965 ms (2002002 allocations: 30.56 MiB)
254795.52808703622

julia> function myfunc3()
                  arr1 = [ rand() for _ in 1:1000 ]
                  arr2 = [ rand() for _ in 1:1000 ]
                  s = 0.0
                  for i in eachindex(arr1)
                          for j in eachindex(arr2)
                              s += arr1[i]*arr2[j]
                          end
                    end
                   return s
               end
myfunc3 (generic function with 1 method)

julia> @btime myfunc3()
  873.934 μs (2 allocations: 15.88 KiB)
254500.32451752323

1 Like

arr1 and arr2 are type stable in both myfunc and myfunc2 as Vector{Real}, whereas s is not stable in either loop; this was an example where the function barrier could not add type stability at all, so it makes sense there’s not much change. I believe OP’s use case (and it would’ve helped if there was a runnable MWE to demonstrate a 30x speedup) involved arr1 and arr2 being unstable and having a cascading instability effect on the rest of the method, which would benefit from a function barrier dispatching on arr1 and arr2.

1 Like

Ok, sorry for the imprecision. It was not a good example. Here is a better example and, in this case, the closure does provide some speedup:

julia> function myfunc()
                         T = rand([Float64, Float32])
                         arr1 = [ rand(T) for _ in 1:1000 ]
                         arr2 = [ rand(T) for _ in 1:1000 ]
                         s = zero(T)
                         (() -> begin
                             for i in eachindex(arr1)
                                 for j in eachindex(arr2)
                                     s += arr1[i]*arr2[j]
                                 end
                             end
                          end)()
                          return s
                      end
myfunc (generic function with 1 method)

julia> @btime myfunc()
  16.917 ms (2002009 allocations: 30.56 MiB)
250924.19f0

julia> function myfunc()
                         T = rand([Float64, Float32])
                         arr1 = [ rand(T) for _ in 1:1000 ]
                         arr2 = [ rand(T) for _ in 1:1000 ]
                         s = zero(T)
                         
                             for i in eachindex(arr1)
                                 for j in eachindex(arr2)
                                     s += arr1[i]*arr2[j]
                                 end
                             end
                          
                         return s
                      end
myfunc (generic function with 1 method)

julia> @btime myfunc()
  54.250 ms (4980007 allocations: 76.00 MiB)
253851.98f0

it does not compare with an actual function barrier, though:

julia> function barrier(arr1, arr2)
           s = zero(eltype(arr1))
           for i in eachindex(arr1)
               for j in eachindex(arr2)
                   s += arr1[i]*arr2[j]
               end
           end
           return s
       end
barrier (generic function with 1 method)

julia> function myfunc_barrier()
                         T = rand([Float64, Float32])
                         arr1 = [ rand(T) for _ in 1:1000 ]
                         arr2 = [ rand(T) for _ in 1:1000 ]
                         s = barrier(arr1, arr2) 
                         return s
                      end
myfunc_barrier (generic function with 1 method)

julia> @btime myfunc_barrier()
  999.826 μs (2008 allocations: 39.55 KiB)
260743.7f0

to be sincere, I don’t understand why the closure helps at all there.

1 Like

The difference is in the anonymous closure barrier version, s is captured and reassigned, so the compiler gives up on inferring it. In the global function barrier version, s is entirely local to the barrier function and can be inferred from the arguments arr1 and arr2. That would explain saving the 2x1000x1000 allocations, and I expect the remaining 2008 allocations occur in the random selection of T and creation of arr1 and arr2. I suppose you were simply so used to writing generic type stability into your code, you don’t even notice it anymore.

2 Likes

I think I have not expressed my doubt well. I understand why the “normal” function barrier works, because everything is type-stable within it (the barrier(...) function is completely type-stable).

What I don’t understand is what benefit the closure brings relative to just writing the loops directly into the type-unstable function, since in that pattern none of the type-unstable variables is passed to the closure as parameters, they are just captured, thus from what I see they continue to be type-unstable inside the closure.

Being more clear about what code I’m referring to:

With the closure:

julia> function myfunc()
                         T = rand([Float64, Float32])
                         arr1 = [ rand(T) for _ in 1:1000 ]
                         arr2 = [ rand(T) for _ in 1:1000 ]
                         s = zero(T)
                         (() -> begin
                             for i in eachindex(arr1)
                                 for j in eachindex(arr2)
                                     s += arr1[i]*arr2[j]
                                 end
                             end
                          end)()
                          return s
                      end
myfunc (generic function with 1 method)

julia> @btime myfunc()
  16.917 ms (2002009 allocations: 30.56 MiB)
250924.19f0

Without the closure:

julia> function myfunc()
                         T = rand([Float64, Float32])
                         arr1 = [ rand(T) for _ in 1:1000 ]
                         arr2 = [ rand(T) for _ in 1:1000 ]
                         s = zero(T)
                         
                             for i in eachindex(arr1)
                                 for j in eachindex(arr2)
                                     s += arr1[i]*arr2[j]
                                 end
                             end
                          
                         return s
                      end
myfunc (generic function with 1 method)

julia> @btime myfunc()
  54.250 ms (4980007 allocations: 76.00 MiB)
253851.98f0
1 Like

I didn’t summarize this explicitly, but your closure version and the barrier version are not equivalent. A closure equivalent to your barrier version is:

julia> function innerbarr()
                         T = rand([Float64, Float32])
                         arr1 = [ rand(T) for _ in 1:1000 ]
                         arr2 = [ rand(T) for _ in 1:1000 ]
                         #####
                         (() -> begin
                             s = zero(eltype(arr1)) #####
                             for i in eachindex(arr1)
                                 for j in eachindex(arr2)
                                     s += arr1[i]*arr2[j]
                                 end
                             end
                             return s ## ##
                          end)()
                          ## ##
                      end

I expect that when you run that on your machine, you get about the same performance as your barrier version. If you return the closure instead of calling it, you’ll find that it’s type stable (@code_warntype innerbarr_returned()() ) despite capturing the unstable arr1 and arr2, which seems to be your remaining doubt. As I’ve commented before, it’s simply that the closure is implemented by a callable instance of a hidden parametric type with a hidden method, one parameter for each captured variable that the parser could prove to have been assigned only once. You’re right that external arr1, arr2, and the anonymous closure are not inferrable, but like any other function barrier, the hidden method is dynamically dispatched by the runtime type of the closure to efficient specializations, something like (_h::_Hidden{Vector{Float64}, Vector{Float64}} )() or (_h::_Hidden{Vector{Float32}, Vector{Float32}} )() in the case of innerbarr.

As for why moving s fully into the closure or barrier function is so crucial, it’s because the parser can prove that s is reassigned, so the compiler cannot infer a captured s in a closure, which will lack a parameter for it.

3 Likes

Wow, thanks for such thorough replies and test cases that appeared overnight. I’ve certainly learnt a lot about how closures are implemented in Julia, thanks especially to @Benny for their clear explanations.

So I guess the takeaway is: from a performance standpoint, a closure can work just as effectively as a function barrier for handling type unstable code, so long as consideration is paid to variable reassignment.

As for whether this is good style is an entirely different question. And at least in the particular instance that drove this discussion in the first place, I was able to provide (verbose) type assertions on the type unstable calls and leave the hot loops local to the function.

4 Likes

Also of note is this Julia issue that might be interesting regarding the performance of captured variables in closures: performance of captured variables in closures · Issue #15276 · JuliaLang/julia (github.com)

1 Like

I also don’t like the kernel function pattern much. However I prefer it over potentially running into the captured variable bug. While that bug is rare in practice, you always have to keep it in mind, so it just adds needless friction to the dev process.

I prefer to put barrier functions in the global scope because I might reuse them in other methods, but I do see the readability of nesting it if I know I won’t reuse it. To avoid uninferrability when variables are unannotated and reassigned, I would still limit captured variables, whether by 1) making some fully local to the nested method, 2) passing some in as arguments, or 3) passing some out as returned values.

2 Likes

Thank you very much @Benny for your insights.

Yet, there is something that goes against what I was used to see and to be suggested. The examples above seem to imply that the closure correctly infers the types of the arrays, even if they are not parameters of the closure.

This does not happen, for example, in this simple case, where the closure is written in global scope:

julia> solver(f, x) = f(x)
solver (generic function with 1 method)

julia> f(x,a,b) = (a' * b) * x
f (generic function with 1 method)

julia> g(x) = f(x,a,b) # the closure
g (generic function with 1 method)

julia> a = rand(3); b = rand(3); # non-constant global parameters

julia> @btime solver($g, 1.0) # this allocates
  31.289 ns (2 allocations: 32 bytes)
0.3231943848334305

julia> function solver_barrier(f, x, a, b)
           solver(x -> f(x,a,b), x)
       end
solver_barrier (generic function with 1 method)

julia> @btime solver_barrier($f, 1.0, $a, $b) # this does not allocate
  10.356 ns (0 allocations: 0 bytes)
0.3231943848334305

julia> @code_warntype solver(g, 1.0)
MethodInstance for solver(::typeof(g), ::Float64)
  from solver(f, x) in Main at REPL[1]:1
Arguments
  #self#::Core.Const(solver)
  f::Core.Const(g)
  x::Float64
Body::Any
1 ─ %1 = (f)(x)::Any
└──      return %1

In our examples, arr1 and arr2 where intrinsically type-unstable inside the scope of the function where the closure was defined. So what is the difference here? Why the compiler does not do the same for this globally scoped closure with non-constant parameters?

I was with the impression that the closures where implemented somewhat like this:

julia> struct Closure{T1,T2}
           a::T1
           b::T2
       end

julia> (c::Closure)(x) = (c.a' * c.b) * x

julia> const c = Closure(a,b)
Closure{Vector{Float64}, Vector{Float64}}([0.010136292594841279, 0.3915207261659518, 0.2766994600429509], [0.42788944246939575, 0.646503420278015, 0.2375779253782898])

julia> @btime solver($c, 1.0) 
  10.061 ns (0 allocations: 0 bytes)
0.3231943848334305

AFAIK, this can’t be the case, since that would imply a semantic difference relative to g(x) = f(x,a,b), since in this last case a and b can change, where in the callable struct it can’t. So how a closure is handled is actually dependent on the scope and context where it is defined.

Actually I understand now, I think, what was going on in the previous examples: Even if arr1 and arr2 are type-unstable, the compiler can guarantee that their types do not change while a particular instance the closure is being executed, and thus it can construct it in the form above for each set of types of arr1 and arr2, while the compiler cannot have the same guarantees if the closures is defined in global scope.

At the same time, the fact that it does not infer s seems to me just an inference bug, or failure, as the guarantees of its type are clearly deducible from the types of arr1 and arr2.

In summary, I find the use of closures still quite tricky when critical performance is on the table, and my use of them are very restricted, particularly for cases like that of the solver_barrier case, where we are sure that everything is strictly type stable in the scope where the closure is defined.

Looks similar to Outlining of expression bodies into own function · Issue #21925 · JuliaLang/julia · GitHub.

3 Likes

Because the compiler can prove if a local variable is never reassigned within its local scopes. The same cannot be said in the global scope because you can interactively reassign a non-const global variable at will. If you const the variable so it cannot be reassigned, then it’s inferred just as well.

In a sense, but it’s a difficult failure to fix (issue #15276). Currently, any method is compiled upon a function call that provides the runtime input types for inference. Since methods are called one at a time, they cannot be compiled together. So, a variable shared by >=2 methods is almost impossible to infer; when you compile a method, you cannot assume that the shared variable’s type is unchanged by the other not-yet-compiled methods. The best the compiler can do is check if the shared variable is not reassigned among the methods or if it had been explicitly annotated (just like how an explicit function-like object would be type-stable). arr1 and arr2 were not reassigned anywhere, so the closure’s hidden fields for them could be a fixed type determined at runtime. However, s was reassigned in the closure. The closure’s hidden instance has to be instantiated before its hidden method is called and inferred, so s is put in a Core.Box field of the instance, which then prevents s from being inferred in the hidden method when it is called.

The way I see it, we really want the compiler to automatically put in function barriers so we don’t have to micromanage unstable or reassigned variables ourselves; recall how it took a few tries to put the function barrier in the right place to not divide a reassigned s across two methods. Thing is, the more unstable and reassigned variables like s, the less clear where and how many function barriers should be placed. Adding a barrier for each variable would really complicate and bloat compilation. Manual redesigns of type stability and function barriers is crucial for simplicity for now.