Performance Affected Significantly More in a Child Task

Dear all,

When experimenting to reproduce the behavior of the standard Base64 library in several ways - see also
Allocation Behavior Depending on Code Size
I noticed another rather unusual behavior of performance. A certain modification of code, which has just a slight effect on performance when called in a usual way, slows down the computation substantially when used in a child task produced by a call to Channel().

Below are the main functions doing the job - decoding, packing and saving.

"""
Decodes a single Base64 char to a sextet
"""
decode_char(char::UInt8) =
    0x41 <= char <= 0x5a ? char - 0x41 :
    0x61 <= char <= 0x7a ? char - 0x47 :
    0x30 <= char <= 0x39 ? char + 0x04 :
    char == 0x2b ? 0x3e :
    char == 0x2f ? 0x3f :
    error("Sorry, wrong Base64 character $char .")

"""
Packs a sequence of sextets into a sequence of octets.
For a function g, input(g) passes the sequence to g by calling it repeatedly -
one input token by one. Similarly, the resulting octets are passed to the
output function one by one.
"""
function sextets2octets(input::Function, output::Function)
    # References are used in order to prevent possible heap allocation.
    captured = Ref(0 % UInt16)  # captured bits aligned left, first captured left-most
    bits = Ref(0)               # number of bits captured
    input() do sextet::UInt8
        bits[] += 6
        captured[] |= (UInt16(sextet) << (16 - bits[]))
        if bits[] >= 8
            output(UInt8(captured[] >> 8))
            captured[] <<= 8
            bits[] -= 8
        end
    end
    return nothing
end

"""
Suggested by @sgaure , thank you once again!
"""
function savethem(size)
  v = UInt8[]
  sizehint!(v, size)
  x -> isnothing(x) ? v : push!(v, x)
end

Now load the data.

using Downloads
using BenchmarkTools

io1 = IOBuffer()
# A sequence of 65536 randomly generated Base64 characters
Downloads.download("https://github.com/Martin-exp-z2/2025_09_22-Allocation_test/raw/refs/heads/main/chars1.txt", io1)
seekstart(io1)
chars1::Vector{UInt8} = read(io1)
close(io1)

size1_octets = 3*length(chars1)á4

chars1_input(output::Function) = foreach(output, chars1)

This seems to be the most straightford way to do the job using the above functions.

function decode_directly1(input::Function, output::Function)
    sextets2octets(output) do middle::Function
        input(middle ∘ decode_char)
    end
    return output(nothing)
end

It takes quite reasonable time:

julia> octets1 = decode_directly1(chars1_input, savethem(size1_octets));

julia> @btime decode_directly1(chars1_input, savethem(size1_octets));
  179.709 Îźs (6 allocations: 64.11 KiB)

Now use an anonymous function instead of the composition operator:

function decode_directly2(input::Function, output::Function)
    sextets2octets(output) do middle::Function
        input() do char::UInt8
            middle(decode_char(char))
        end
    end
    return output(nothing)
end

The effect of this change on performance is almost negligible.

julia> octets1_new = decode_directly2(chars1_input, savethem(size1_octets));

julia> @assert octets1_new == octets1

julia> @btime decode_directly2(chars1_input, savethem(size1_octets));
  181.041 Îźs (6 allocations: 64.11 KiB)

Now make the implementation more complicated, wrapping the calculation in a task
and using a channel to control the running of the task. Notice that only one token
is sent through the channel.

function decode_channel1(input::Function, output::Function)
    intermediate = Ref{Function}()
    ch = Channel{Nothing}() do c
        put!(c, nothing)  # blocks until intermediate[] is assigned
        input(intermediate[] ∘ decode_char)
    end
    sextets2octets(output) do middle::Function
      intermediate[] = middle
      for _ in ch; end
    end
    return output(nothing)
end

The cost of this overhead is also quite reasonable.

julia> octets1_new = decode_channel1(chars1_input, savethem(size1_octets));

julia> @assert octets1_new == octets1

julia> @btime decode_channel1(chars1_input, savethem(size1_octets));
  292.125 Îźs (28 allocations: 65.23 KiB)

Now replace the composition operator by an anonymous function - just as above.

function decode_channel2(input::Function, output::Function)
    intermediate = Ref{Function}()
    ch = Channel{Nothing}() do c
        put!(c, nothing)  # blocks until intermediate[] is assigned
        input() do char::UInt8
            intermediate[](decode_char(char))
        end
    end
    sextets2octets(output) do middle::Function
      intermediate[] = middle
      for _ in ch; end
    end
    return output(nothing)
end

Surprisingly, this change now slows down the computation by more than three times!

julia> octets1_new = decode_channel2(chars1_input, savethem(size1_octets));

julia> @assert octets1_new == octets1

julia> @btime decode_channel2(chars1_input, savethem(size1_octets));
  937.583 Îźs (27 allocations: 65.20 KiB)

I speculate that this can be due to a kind of competition between different
compliling optimization strategies, but there might also be a deeper reason
and might indicate a scope for improvement. Or have I missed something?

Thank you very much for your time. I will be glad if someone finds this
example useful!

This has probably to do with closure capturing & the type instability that often occurs with it. You can check with Cthulhu.jl or JET.jl to see if there is a type instability there.

I think the relevant issue is this one: unnecessary boxing of variables that are not reassigned within a closure ¡ Issue #56561 ¡ JuliaLang/julia ¡ GitHub

You can find many more examples of this occurring if you search for “closure capture bug” or “closure performance”.

It doesn’t look like the longstanding difficulty of inferring captured variables because mraic1 is deliberately avoiding reassigning variables. What stands out to me between the first half and the second half of the examples is intermediate = Ref{Function}(). Accessing that can only be inferred by the compiler as a value of an abstract type Function, in other words it’s not type stable. If you’re really swapping a bunch of different Julia functions there, that’s actually a reasonable thing to deal with; it’s a dynamically typed language after all. A possible improvement would be having specific input and output types so you can wrap those functions in FunctionWrapper, but I don’t want to make assumptions before addressing the problem at hand.

In the composition example, you access intermediate once right before composition with decode_char. Despite the type instability, input is dispatched at runtime over the value of that composition, so its own call gets to be type stable; this technique for method-wise JIT compilation is called a function barrier.

In the anonymous function / do block example, input receives an anonymous function that accesses intermediate after every time decode_char is called. In other words, that runtime dispatch overhead was moved deeper and is happening many more times. It pays to not repeat the same work in a program, sometimes the compiler can’t get rid of it for you.

2 Likes

I agree with @Benny. In the slow example, input is called with a function which looks inside the Ref{Function} and does a dynamic dispatch on that every time it’s called. In the faster example, the lookup in the Ref{Function} is done only once, and input is called with its content.

1 Like

Btw, it is a bit overkill to use the Channel only for waiting until things are ready. This is what Base.Semaphore is for.
Somehing like:

sem = Base.Semaphore(1)
Base.acquire(sem)  # take it immediately
Threads.@spawn begin
    Base.acquire(sem)  # wait for semaphore to be released
    # use intermediate[]
end
...
   intermediate[] = something
   Base.release(sem) 
...

Alternatively, an Event can be used:

ready = Threads.Event()
Threads.@spawn begin
    wait(ready)
    # use intermediate[]
end
...
   intermediate[] = something
   notify(ready)
...

Didn’t look into this deeply, but here’s one difference:

This dereferences intermediate before composition, so before the composed function is executed.

This only dereferences intermediate in the body of the closure.

Profiling, Cthulhu, etc. could tell you more.

1 Like

Thank you for your prompt reply! First, great to learn about FunctionWrapper.jl. :smiley: Next, thank you for the explanation about accessing intermediate. I was not aware that f(g ∘ h) works differently from f() do x; g(h(x)); end, as the first call looks up the current value of g ∘ h, forwarding it to f as the argument, while the second call looks up g and h at each call of the function which has been given to f as the argument. Thus, the function composition operator is not just syntactic sugar. Please correct me if I am mistaken.

Clearly, optimization is possible knowing that middle, respectively intermediate[], remains constant within the call of input(). However, for the compiler, the latter fact is much more difficult to establish in the second case.

Thank you for the lesson!

Yes indeed, this is probably the most important part of @Benny’s more detailed explanation!

1 Like

Well yes and no. It is a normal infix operator which means that g ∘ h is the same as ∘(g, h) where ∘ is just an ordinary name of an ordinary function.
So there is some level of syntactig at play here (namely the use of an infix operator) but no special stuff that just affects composition ∘.

1 Like

I think it’s worth defining these things more precisely. Syntactic sugar is a feature of the core language to write a more convenient and equivalent form of another expression e.g. A[i] means Base.getindex(A, i) , f() do x x end means f(x -> x). You can use Meta.@lower to see the equivalent lowered code of these pairs of expressions, but it’s worth noting that lowered code, compiled code, and what functions like Base.getindex or f do at execution are irrelevant to high level syntax (high level with respect to abstraction). The language specifies syntactic sugar, and it currently doesn’t specify ∘ as syntactic sugar for anything. I doubt it ever will because ∘ is a public and exported name in Base assigned to a generic function we can add methods to.

You’ve grasped the reason for the performance discrepancy in your original example, but you’re misapplying it to a simplified one and saying some incorrect things.

  • f(g ∘ h) is 2 calls in 1 expression. The 1st call is g ∘ h, which takes the values assigned to g and h as inputs and returns a composite function. As you correctly stated, the 2nd call f(...) takes the composite function as an input.
  • The function defined by do x; g(h(x)); end does semantically access the values assigned to g and h at the time of each call, unlike the composite function, but whether that involves the actual work of a lookup at each call is uncertain. If g and h are const names that can’t be reassigned, then that lookup can be done at compile-time, letting the runtime call quickly access cached values. Obviously that optimization does not occur in your original example where you had a type-unstable intermediate[] call in place of g. A type-stable intermediate[] call would generally optimize away method lookup as well, though the different functions in that case would be very similar.

It’s worth noting that the latter approach isn’t wrong. If you need to change the value assigned to intermediate dynamically and execute them for each char, then it’d actually be the only correct one. It doesn’t appear that you’re doing that so far, but I don’t know what you intend.

1 Like

Yes, this is what I actually intended to do. I only came across the function composition operator when I was trying to extract the issue that seemed to down my code. Noting the apparently strange relationship between the performance of the four ways to decode the given sequence, I considered my initial intention unimportant.

More important, did I understand correctly that FunctionWrapper.jl is the way to make intermediate[] type-stable? Keeping in mind that each function has its own type, which is a direct subtype of Function, right?

By the way, your detailed answers and criticism helped me not only to better understand how Julia works, but (implicitly) also to redesign the code: instead of accepting input and running it as whole, a better design of sextets2octets would probably be to accept (in addition to output) a single sextet plus a mutable struct containing captured and bits. This allows the user to call it with different output for each sextet, there is no need to use references, channels, events or semaphores (though, @sgaure, thank you for pointing out the latter two!) and performance should not be drastically affected.

Using the function composition operator prior to calling input won’t allow for dynamic changes to the component functions, you have to do it inside the do char::UInt8 block of the 4th and last example, based on the same logic as the timing of accessing intermediate[]. Hypothetically there could be a mutable function composition, but there wouldn’t be a nice way to do that mutation, and it’s not compatible with the approach of closures capturing intermediate.

To clarify some basics, it is true that named or anonymous functions will automatically get an internal type that subtypes Function. However, callable objects don’t always have types that subtype Function, and the types that do may instantiate multiple functions. In fact, closures are usually the functions that don’t have singleton types because you can instantiate multiple ones that captured different variables. From this angle, type stability would let the compiler recognize that the functions have the same structure and use the same set of methods; it’s not much of a stretch from handling a singleton function.

Thing is, it’s not that common to apply a group of functions sharing a method set to various calls with often wildly different input/output types. It’s more common to apply a group of entirely unrelated functions to a call with particular input/output types. The third-party FunctionWrappers has some serious limitations and isn’t actively developed, but it’s currently the primary tool for this. Its intent is to convert the inputs/outputs to the specified input/output types (may not actually happen at runtime if compiler spots redundancy) with an intermediate cached method call, dodging the typical method dispatch implementation. It has some overhead compared to static dispatch (especially with inlining optimizations), but it’s a lot faster than runtime dispatch and the specified output type helps maintain type stability in the following code.

In your case, it doesn’t appear that you have any particular structure for the intermediate functions in mind, hence your choice of Ref{Function}(). However, intermediate[] does take decode_char(char::UInt8)::UInt8 as input and values are eventually saved to a Vector{UInt8}, so there’s an implication you intend intermediate[] calls to have known input and output types suited for FunctionWrappers.

1 Like