Dear all,
When experimenting to reproduce the behavior of the standard Base64 library in several ways - see also
Allocation Behavior Depending on Code Size
I noticed another rather unusual behavior of performance. A certain modification of code, which has just a slight effect on performance when called in a usual way, slows down the computation substantially when used in a child task produced by a call to Channel()
.
Below are the main functions doing the job - decoding, packing and saving.
"""
Decodes a single Base64 char to a sextet
"""
decode_char(char::UInt8) =
0x41 <= char <= 0x5a ? char - 0x41 :
0x61 <= char <= 0x7a ? char - 0x47 :
0x30 <= char <= 0x39 ? char + 0x04 :
char == 0x2b ? 0x3e :
char == 0x2f ? 0x3f :
error("Sorry, wrong Base64 character $char .")
"""
Packs a sequence of sextets into a sequence of octets.
For a function g, input(g) passes the sequence to g by calling it repeatedly -
one input token by one. Similarly, the resulting octets are passed to the
output function one by one.
"""
function sextets2octets(input::Function, output::Function)
# References are used in order to prevent possible heap allocation.
captured = Ref(0 % UInt16) # captured bits aligned left, first captured left-most
bits = Ref(0) # number of bits captured
input() do sextet::UInt8
bits[] += 6
captured[] |= (UInt16(sextet) << (16 - bits[]))
if bits[] >= 8
output(UInt8(captured[] >> 8))
captured[] <<= 8
bits[] -= 8
end
end
return nothing
end
"""
Suggested by @sgaure , thank you once again!
"""
function savethem(size)
v = UInt8[]
sizehint!(v, size)
x -> isnothing(x) ? v : push!(v, x)
end
Now load the data.
using Downloads
using BenchmarkTools
io1 = IOBuffer()
# A sequence of 65536 randomly generated Base64 characters
Downloads.download("https://github.com/Martin-exp-z2/2025_09_22-Allocation_test/raw/refs/heads/main/chars1.txt", io1)
seekstart(io1)
chars1::Vector{UInt8} = read(io1)
close(io1)
size1_octets = 3*length(chars1)÷4
chars1_input(output::Function) = foreach(output, chars1)
This seems to be the most straightford way to do the job using the above functions.
function decode_directly1(input::Function, output::Function)
sextets2octets(output) do middle::Function
input(middle ∘ decode_char)
end
return output(nothing)
end
It takes quite reasonable time:
julia> octets1 = decode_directly1(chars1_input, savethem(size1_octets));
julia> @btime decode_directly1(chars1_input, savethem(size1_octets));
179.709 μs (6 allocations: 64.11 KiB)
Now use an anonymous function instead of the composition operator:
function decode_directly2(input::Function, output::Function)
sextets2octets(output) do middle::Function
input() do char::UInt8
middle(decode_char(char))
end
end
return output(nothing)
end
The effect of this change on performance is almost negligible.
julia> octets1_new = decode_directly2(chars1_input, savethem(size1_octets));
julia> @assert octets1_new == octets1
julia> @btime decode_directly2(chars1_input, savethem(size1_octets));
181.041 μs (6 allocations: 64.11 KiB)
Now make the implementation more complicated, wrapping the calculation in a task
and using a channel to control the running of the task. Notice that only one token
is sent through the channel.
function decode_channel1(input::Function, output::Function)
intermediate = Ref{Function}()
ch = Channel{Nothing}() do c
put!(c, nothing) # blocks until intermediate[] is assigned
input(intermediate[] ∘ decode_char)
end
sextets2octets(output) do middle::Function
intermediate[] = middle
for _ in ch; end
end
return output(nothing)
end
The cost of this overhead is also quite reasonable.
julia> octets1_new = decode_channel1(chars1_input, savethem(size1_octets));
julia> @assert octets1_new == octets1
julia> @btime decode_channel1(chars1_input, savethem(size1_octets));
292.125 μs (28 allocations: 65.23 KiB)
Now replace the composition operator by an anonymous function - just as above.
function decode_channel2(input::Function, output::Function)
intermediate = Ref{Function}()
ch = Channel{Nothing}() do c
put!(c, nothing) # blocks until intermediate[] is assigned
input() do char::UInt8
intermediate[](decode_char(char))
end
end
sextets2octets(output) do middle::Function
intermediate[] = middle
for _ in ch; end
end
return output(nothing)
end
Surprisingly, this change now slows down the computation by more than three times!
julia> octets1_new = decode_channel2(chars1_input, savethem(size1_octets));
julia> @assert octets1_new == octets1
julia> @btime decode_channel2(chars1_input, savethem(size1_octets));
937.583 μs (27 allocations: 65.20 KiB)
I speculate that this can be due to a kind of competition between different
compliling optimization strategies, but there might also be a deeper reason
and might indicate a scope for improvement. Or have I missed something?
Thank you very much for your time. I will be glad if someone finds this
example useful!