Performance Affected Significantly More in a Child Task

Dear all,

When experimenting to reproduce the behavior of the standard Base64 library in several ways - see also
Allocation Behavior Depending on Code Size
I noticed another rather unusual behavior of performance. A certain modification of code, which has just a slight effect on performance when called in a usual way, slows down the computation substantially when used in a child task produced by a call to Channel().

Below are the main functions doing the job - decoding, packing and saving.

"""
Decodes a single Base64 char to a sextet
"""
decode_char(char::UInt8) =
    0x41 <= char <= 0x5a ? char - 0x41 :
    0x61 <= char <= 0x7a ? char - 0x47 :
    0x30 <= char <= 0x39 ? char + 0x04 :
    char == 0x2b ? 0x3e :
    char == 0x2f ? 0x3f :
    error("Sorry, wrong Base64 character $char .")

"""
Packs a sequence of sextets into a sequence of octets.
For a function g, input(g) passes the sequence to g by calling it repeatedly -
one input token by one. Similarly, the resulting octets are passed to the
output function one by one.
"""
function sextets2octets(input::Function, output::Function)
    # References are used in order to prevent possible heap allocation.
    captured = Ref(0 % UInt16)  # captured bits aligned left, first captured left-most
    bits = Ref(0)               # number of bits captured
    input() do sextet::UInt8
        bits[] += 6
        captured[] |= (UInt16(sextet) << (16 - bits[]))
        if bits[] >= 8
            output(UInt8(captured[] >> 8))
            captured[] <<= 8
            bits[] -= 8
        end
    end
    return nothing
end

"""
Suggested by @sgaure , thank you once again!
"""
function savethem(size)
  v = UInt8[]
  sizehint!(v, size)
  x -> isnothing(x) ? v : push!(v, x)
end

Now load the data.

using Downloads
using BenchmarkTools

io1 = IOBuffer()
# A sequence of 65536 randomly generated Base64 characters
Downloads.download("https://github.com/Martin-exp-z2/2025_09_22-Allocation_test/raw/refs/heads/main/chars1.txt", io1)
seekstart(io1)
chars1::Vector{UInt8} = read(io1)
close(io1)

size1_octets = 3*length(chars1)÷4

chars1_input(output::Function) = foreach(output, chars1)

This seems to be the most straightford way to do the job using the above functions.

function decode_directly1(input::Function, output::Function)
    sextets2octets(output) do middle::Function
        input(middle ∘ decode_char)
    end
    return output(nothing)
end

It takes quite reasonable time:

julia> octets1 = decode_directly1(chars1_input, savethem(size1_octets));

julia> @btime decode_directly1(chars1_input, savethem(size1_octets));
  179.709 μs (6 allocations: 64.11 KiB)

Now use an anonymous function instead of the composition operator:

function decode_directly2(input::Function, output::Function)
    sextets2octets(output) do middle::Function
        input() do char::UInt8
            middle(decode_char(char))
        end
    end
    return output(nothing)
end

The effect of this change on performance is almost negligible.

julia> octets1_new = decode_directly2(chars1_input, savethem(size1_octets));

julia> @assert octets1_new == octets1

julia> @btime decode_directly2(chars1_input, savethem(size1_octets));
  181.041 μs (6 allocations: 64.11 KiB)

Now make the implementation more complicated, wrapping the calculation in a task
and using a channel to control the running of the task. Notice that only one token
is sent through the channel.

function decode_channel1(input::Function, output::Function)
    intermediate = Ref{Function}()
    ch = Channel{Nothing}() do c
        put!(c, nothing)  # blocks until intermediate[] is assigned
        input(intermediate[] ∘ decode_char)
    end
    sextets2octets(output) do middle::Function
      intermediate[] = middle
      for _ in ch; end
    end
    return output(nothing)
end

The cost of this overhead is also quite reasonable.

julia> octets1_new = decode_channel1(chars1_input, savethem(size1_octets));

julia> @assert octets1_new == octets1

julia> @btime decode_channel1(chars1_input, savethem(size1_octets));
  292.125 μs (28 allocations: 65.23 KiB)

Now replace the composition operator by an anonymous function - just as above.

function decode_channel2(input::Function, output::Function)
    intermediate = Ref{Function}()
    ch = Channel{Nothing}() do c
        put!(c, nothing)  # blocks until intermediate[] is assigned
        input() do char::UInt8
            intermediate[](decode_char(char))
        end
    end
    sextets2octets(output) do middle::Function
      intermediate[] = middle
      for _ in ch; end
    end
    return output(nothing)
end

Surprisingly, this change now slows down the computation by more than three times!

julia> octets1_new = decode_channel2(chars1_input, savethem(size1_octets));

julia> @assert octets1_new == octets1

julia> @btime decode_channel2(chars1_input, savethem(size1_octets));
  937.583 μs (27 allocations: 65.20 KiB)

I speculate that this can be due to a kind of competition between different
compliling optimization strategies, but there might also be a deeper reason
and might indicate a scope for improvement. Or have I missed something?

Thank you very much for your time. I will be glad if someone finds this
example useful!

This has probably to do with closure capturing & the type instability that often occurs with it. You can check with Cthulhu.jl or JET.jl to see if there is a type instability there.

I think the relevant issue is this one: unnecessary boxing of variables that are not reassigned within a closure · Issue #56561 · JuliaLang/julia · GitHub

You can find many more examples of this occurring if you search for “closure capture bug” or “closure performance”.

It doesn’t look like the longstanding difficulty of inferring captured variables because mraic1 is deliberately avoiding reassigning variables. What stands out to me between the first half and the second half of the examples is intermediate = Ref{Function}(). Accessing that can only be inferred by the compiler as a value of an abstract type Function, in other words it’s not type stable. If you’re really swapping a bunch of different Julia functions there, that’s actually a reasonable thing to deal with; it’s a dynamically typed language after all. A possible improvement would be having specific input and output types so you can wrap those functions in FunctionWrapper, but I don’t want to make assumptions before addressing the problem at hand.

In the composition example, you access intermediate once right before composition with decode_char. Despite the type instability, input is dispatched at runtime over the value of that composition, so its own call gets to be type stable; this technique for method-wise JIT compilation is called a function barrier.

In the anonymous function / do block example, input receives an anonymous function that accesses intermediate after every time decode_char is called. In other words, that runtime dispatch overhead was moved deeper and is happening many more times. It pays to not repeat the same work in a program, sometimes the compiler can’t get rid of it for you.

I agree with @Benny. In the slow example, input is called with a function which looks inside the Ref{Function} and does a dynamic dispatch on that every time it’s called. In the faster example, the lookup in the Ref{Function} is done only once, and input is called with its content.

Btw, it is a bit overkill to use the Channel only for waiting until things are ready. This is what Base.Semaphore is for.
Somehing like:

sem = Base.Semaphore(1)
Base.acquire(sem)  # take it immediately
Threads.@spawn begin
    Base.acquire(sem)  # wait for semaphore to be released
    # use intermediate[]
end
...
   intermediate[] = something
   Base.release(sem) 
...

Didn’t look into this deeply, but here’s one difference:

This dereferences intermediate before composition, so before the composed function is executed.

This only dereferences intermediate in the body of the closure.

Profiling, Cthulhu, etc. could tell you more.