Dear all,
I am experimenting a bit on stream processing. To cope with this, Julia has a beautifully designed system of channels and tasks. However, the latter appear rather slow, even very slow when the channels are not comfortably buffered. Therefore, I am looking for alternatives. Following the Performance Tips, I tried to write code which only allocates a fixed amount of space (not depending on the size of the input stream). This would also allow it to run faster. Unfortunately, I was not entirely successful. However, to my surprise, I noticed that just the size (not complexity) of the code can decide whether extensive allocation will happen or not.
In the case we have just one input and one output stream with a simple chain of processes in-between, one possible solution is continuation-passing style. A processing function accepts two arguments, its input and output. The output is implemented as a function accepting a single argument - the output token. The input is also a function accepting a function, which itself accepts a single argument - the input token. As the input function f
called with function g
as its argument produces a token, it runs g
with it.
This might not be the best solution, but it will help demonstrating the problem I want to expose here.
As an example, I wanted to reproduce the decoding of base64 (which is otherwise a part of the Standard Library). Below are the main functions, which are a bit more involved than necessary. This makes the compiler more busy and causes undesirable memory allocation:
const char_A::UInt8 = UInt8('A')
const char_Z::UInt8 = UInt8('Z')
const char_a::UInt8 = UInt8('a')
const char_z::UInt8 = UInt8('z')
const char_0::UInt8 = UInt8('0')
const char_9::UInt8 = UInt8('9')
const char_plus::UInt8 = UInt8('+')
const char_slash::UInt8 = UInt8('/')
"decode a Base64 character to a sextet of bits"
decode_char(char::UInt8)::UInt8 =
char_A <= char <= char_Z ? char - char_A :
char_a <= char <= char_z ? char - char_a + char_Z - char_A + 0x01 :
char_0 <= char <= char_9 ? char - char_0 + char_z - char_a + char_Z - char_A + 0x02 :
char == char_plus ? 0x3e :
char == char_slash ? 0x3f :
0x00 # error("Wrong character `$(Char(char))'!")
"""
Pack a sequence of sextets of bits into a sequence of octets in a
continuation-passing style (CPS): each octet is given as an argument to
the output function. Only complete octets are forwarded.
The input is also implemented in CPS: it is a function which is given a
continuation function as the only argument.
"""
function sextets2octets(input::Function, output::Function)::Nothing
captured::UInt16 = UInt16(0) # captured bits aligned left, first captured left-most
bits::UInt8 = 0 # number of bits captured
function pack(sextet::UInt8)::Nothing
bits += 6
captured |= (UInt16(sextet) << (16 - bits))
if bits >= 8
output(UInt8(captured >> 8))
captured <<= 8
bits -= 8
end
return nothing
end
input(pack)
return nothing
end
function chars_stateful2octets(chars, decoder::Function, output::Function)
for char::UInt8 in Iterators.Stateful(chars)
output(decoder(char))
end
end
Now take 64k of base64
characters and combine both main functions, eventually discarding the output:
using Downloads
io1 = IOBuffer()
Downloads.download("https://github.com/Martin-exp-z2/2025_09_22-Allocation_test/raw/refs/heads/main/chars1.txt", io1)
seekstart(io1)
chars1::Vector{UInt8} = read(io1)
close(io1)
chars1_decoded_stateful2octets(con) =
chars_stateful2octets(chars1, decode_char, con)
discard(::UInt8) = nothing
As mentioned above, this gives rise to memory allocation:
julia> @time sextets2octets(chars1_decoded_stateful2octets, discard)
0.001060 seconds (92.28 k allocations: 1.408 MiB)
julia> @time sextets2octets(chars1_decoded_stateful2octets, discard)
0.001051 seconds (92.28 k allocations: 1.408 MiB)
Now replace decode_char(char::UInt8)
with a function which is just shorter (and, by the way, does not do the job properly, though this does not matter):
decode_char_wrong(char::UInt8)::UInt8 =
char_0 <= char <= char_z ? char - char_A :
char == char_plus ? 0x3e :
char == char_slash ? 0x3f :
0x00
chars1_decoded_wrong_stateful2octets(con) =
chars_stateful2octets(chars1, decode_char_wrong, con)
Now the whole process no longer allocates and is about 20 times faster:
julia> @time sextets2octets(chars1_decoded_wrong_stateful2octets, discard)
0.000054 seconds
julia> @time sextets2octets(chars1_decoded_wrong_stateful2octets, discard)
0.000053 seconds
Could anybody explain what is going on? Have I overlooked something? I am running Julia 1.11.7 on MacOS Sequoia 15.6.1 on MacBook Pro with Apple M4 Max.
Thanks a lot in advance!