Performance of closures

performance

#1

Hi

As closures are interesting constructs to build functions with state, I am trying to use them to build automatically finite-state machines. I compared the performance of 10000 calls to a closure with a hand-made version of the same code (based on the explanation in the docs: “A closure is simply a callable object with field names corresponding to captured variables.”).
Using the @time macro (Julia v0.6-dev) I get the following results:
Closure: 0.004185 seconds (20.01 k allocations: 312.750 KiB)
Hand-made version: 0.000055 seconds (5 allocations: 192 bytes)
Code is available in gist.
I was wondering what causes this big difference, so I tried code_warntype and I found out that the generated code is far more complicate than the hand-made version and this can explain the performance difference.
I have also made a not yet published package containing a macro that can generate automatically the hand-made code (I think the name of the package has to change to make it more meaningful;) ).
I prefer however to use the builtin closure functionality but the performance (runtime and memory usage) worries me. If a user macro can generate this more optimal code, the compiler can do the same (perhaps I am underestimating the number of corner cases that have to be handled but I have not yet found anything that can not be done with the macro).


#2

I believe you’re encountering this issue: https://github.com/JuliaLang/julia/issues/15276

I can reproduce the issue, but the @label and @gotos in your code made it difficult for me to follow, and actually obscure a minor bug in that your _state is repeatedly set to 0xff and then immediately set back to 0x01. A shorter example that reproduces the answer and the performance issue is:

function fibonnaci2(a::Float64=0.0, b::Float64=1.0) 
  _state = 0
  function fib()::Float64
    if _state == 0
      _state = 1
      return a
    elseif _state == 1
      a, b = b, a + b
      return a
    end
  end
end

Looking at code_warntype shows that _state, a, and b are being boxed, consistent with the issue I linked above:

Variables:
  #self#::#fibonnaci2
  a@_2::Float64
  b@_3::Float64
  _state::CORE.BOX
  fib::#fib#1
  a@_6::ANY
  b@_7::ANY

Body:
  begin
      b@_7::ANY = b@_3::Float64
      a@_6::ANY = a@_2::Float64
      a@_6::ANY = $(Expr(:new, :(Core.Box), :(a@_6::Float64)))
      b@_7::ANY = $(Expr(:new, :(Core.Box), :(b@_7::Float64)))
      _state::CORE.BOX = $(Expr(:new, :(Core.Box)))
      (Core.setfield!)(_state::CORE.BOX,:contents,0)::Int64 # line 4:
      fib::#fib#1 = $(Expr(:new, :(Main.#fib#1), :(a@_6::Core.Box), :(b@_7::Core.Box), :(_state)))
      return fib::#fib#1
  end::#fib#1
nothing

Your Fibonacci type is a good workaround, although again I’d suggest losing the @gotos and fixing the unnecessary 0xff state.


#3

Thanks for the answer. I think you’re right about the issue causing the performance degradation.
The use of @goto and @label is however deliberate. The code is created automatically by the macro @stateful you can find in the following package. A small example:

@stateful function fibonnaci()
  a = 0.0
  b = 1.0
  while true
    @yield return a
    a, b = b, a+b
  end
end

fib = fibonnaci()
for i in 1:10
  println(fib())
end

The “stateful” function fib resumes its execution the next time it is called after the @yield return statement. This is possible by transforming the function in a finite-state machine and doing a jump depending on the _state variable to the next line of the last called @yield return statement. _state==0xff means that a normal return has occurred and the “stateful” function will no longer yield new values (an Exception will be thrown, if called again). It is also possible to send values from the caller to the callee…
These are the functionalities I need to replace in SimJulia.jl the Task switching functions produce and consume which are deprecated in julia v0.6 and the Channel construct to synchronise Tasks is too slow for the problem I am trying to solve.
When the issue is solved, I really want to use a closure to implement the same behaviour. This would reduce the amount of “low level” macro code I have to maintain and limits the dependencies on specific julia versions (e.g. lambda_info vs code_info).
If there is enough interest, I will publish the package. The actual name has to change because it does not capture very well what It does. Perhaps “StatefulFunctions” is a better one.
Does someone have an idea when the issue https://github.com/JuliaLang/julia/issues/15276 will be solved?


#4

You can declare the types of your variables to address most of the performance issue you are seeing. In terms of naming, I think I’ve usually seen this transform described as an implementation of stackless coroutines.

Since you have control over the produce/consume parts of the API, have you tested the performance of using the internal Task API directly (basically copying the code in base, but without needing to care about the edge cases)? I can’t promise anything about stability of that API, although it generally has evolved very slowly. The following snippet implement a basic produce/consume (which, unlike base, will not deliver the return value of the function, since that’s where it gets most of it’s complication from).

function fibonnaci()
  lasttask = current_task()
  t = @task begin
    a = 0.0
    b = 1.0
    while true
      lasttask = yieldto(lasttask, a)
      a, b = b, a + b
    end
   end
   yieldto(t) # initialize the task
   return Producer(t)
end
immutable Producer
  t::Task
end
(p::Producer)() = yieldto(p.t, current_task())
# I probably should be using start/next/done here
# instead of callable, but #yolo

You should also be able to substitute the yieldto for notify and schedule(task); wait(cond) to achieve the identical result.


#5

Thanks for the feedback!
I started testing different ways to use Tasks after the deprecation of consume and produce. See gist. Results on Travis:
Julia v0.5:
yieldto
0.005688 seconds (10.01 k allocations: 157.828 KB)
schedule_and_wait
0.006251 seconds (10.01 k allocations: 158.016 KB)
produce
0.009413 seconds (20.01 k allocations: 314.719 KB)
statemachine
0.000669 seconds (5 allocations: 192 bytes)
Julia v0.6-dev:
yieldto
0.005238 seconds (10.01 k allocations: 157.875 KiB)
schedule_and_wait
0.005645 seconds (10.01 k allocations: 157.953 KiB)
channel
0.068332 seconds (60.03 k allocations: 1.835 MiB)
statemachine
0.000067 seconds (5 allocations: 192 bytes)

The difference in performance can be explained by the overhead of saving and restoring of the stack when doing a yieldto. The advantage of this approach is that you get “transparent” symmetric coroutines but it is quite slow and consumes a lot of memory.
What I have implemented is called a (stack-less) semi-coroutine. The @stateful function returns always to the caller and is less generic than a symmetric coroutine. The macro is also less permissive (e.g. we can not jump into a try - catch clause; I am already working on a work-around:) but performance-wise it is a big win.
My use-case is the creation of processes in an event-driven framework, so a semi-coroutine suffices and I can profit from the performance when running complex simulations. The documentation of SimJulia will be updated to detail the limitations.
In an ideal world I could use a closure to simplify the macro and do less of low-level stuff but for now I will monitor the boxing issue. I will also publish a package “StatefulFunctions” with a polished version of the code next week. All comments are welcome (names, api, code enhancements, …)


#6

You should be able to verify this by commenting out #define COPY_STACKS in options.h


#7

I compiled the latest julia v0.6-dev with and without #define COPY_STACKS and I got following results.
without #define COPY_STACKS:
yieldto
0.001367 seconds (10.11 k allocations: 1.168 MiB)
schedule_and_wait
0.001486 seconds (10.01 k allocations: 1.161 MiB)
channel
0.053305 seconds (60.03 k allocations: 2.841 MiB)
statemachine
0.000055 seconds (5 allocations: 192 bytes)

with #define COPY_STACKS:
yieldto
0.002546 seconds (10.11 k allocations: 164.750 KiB)
schedule_and_wait
0.002505 seconds (10.01 k allocations: 157.781 KiB)
channel
0.059846 seconds (60.03 k allocations: 1.835 MiB)
statemachine
0.000056 seconds (5 allocations: 192 bytes)

We can observe a trade-off between runtime and memory usage. But the results are still very different from the finite-state machine implementation. I suppose that there is no such thing as a free lunch: symmetric coroutines have a certain cost … and Channels with no buffer (to have a complete synchronisation between the Tasks) are very expensive.