Allocation Behavior Depending on Code Size

Dear all,

I am experimenting a bit on stream processing. To cope with this, Julia has a beautifully designed system of channels and tasks. However, the latter appear rather slow, even very slow when the channels are not comfortably buffered. Therefore, I am looking for alternatives. Following the Performance Tips, I tried to write code which only allocates a fixed amount of space (not depending on the size of the input stream). This would also allow it to run faster. Unfortunately, I was not entirely successful. However, to my surprise, I noticed that just the size (not complexity) of the code can decide whether extensive allocation will happen or not.

In the case we have just one input and one output stream with a simple chain of processes in-between, one possible solution is continuation-passing style. A processing function accepts two arguments, its input and output. The output is implemented as a function accepting a single argument - the output token. The input is also a function accepting a function, which itself accepts a single argument - the input token. As the input function f called with function g as its argument produces a token, it runs g with it.

This might not be the best solution, but it will help demonstrating the problem I want to expose here.

As an example, I wanted to reproduce the decoding of base64 (which is otherwise a part of the Standard Library). Below are the main functions, which are a bit more involved than necessary. This makes the compiler more busy and causes undesirable memory allocation:

const char_A::UInt8 = UInt8('A')
const char_Z::UInt8 = UInt8('Z')
const char_a::UInt8 = UInt8('a')
const char_z::UInt8 = UInt8('z')
const char_0::UInt8 = UInt8('0')
const char_9::UInt8 = UInt8('9')
const char_plus::UInt8 = UInt8('+')
const char_slash::UInt8 = UInt8('/')

"decode a Base64 character to a sextet of bits"
decode_char(char::UInt8)::UInt8 =
    char_A <= char <= char_Z ? char - char_A :
    char_a <= char <= char_z ? char - char_a + char_Z - char_A + 0x01 :
    char_0 <= char <= char_9 ? char - char_0 + char_z - char_a + char_Z - char_A + 0x02 :
    char == char_plus ? 0x3e :
    char == char_slash ? 0x3f :
    0x00  # error("Wrong character `$(Char(char))'!")

"""
Pack a sequence of sextets of bits into a sequence of octets in a
continuation-passing style (CPS): each octet is given as an argument to
the output function. Only complete octets are forwarded.

The input is also implemented in CPS: it is a function which is given a
continuation function as the only argument.
"""
function sextets2octets(input::Function, output::Function)::Nothing
    captured::UInt16 = UInt16(0)  # captured bits aligned left, first captured left-most
    bits::UInt8 = 0               # number of bits captured
    function pack(sextet::UInt8)::Nothing
        bits += 6
        captured |= (UInt16(sextet) << (16 - bits))
        if bits >= 8
            output(UInt8(captured >> 8))
            captured <<= 8
            bits -= 8
        end
        return nothing
    end
    input(pack)
    return nothing
end

function chars_stateful2octets(chars, decoder::Function, output::Function)
    for char::UInt8 in Iterators.Stateful(chars)
        output(decoder(char))
    end
end

Now take 64k of base64 characters and combine both main functions, eventually discarding the output:

using Downloads

io1 = IOBuffer()
Downloads.download("https://github.com/Martin-exp-z2/2025_09_22-Allocation_test/raw/refs/heads/main/chars1.txt", io1)
seekstart(io1)
chars1::Vector{UInt8} = read(io1)
close(io1)

chars1_decoded_stateful2octets(con) =
    chars_stateful2octets(chars1, decode_char, con)

discard(::UInt8) = nothing

As mentioned above, this gives rise to memory allocation:

julia> @time sextets2octets(chars1_decoded_stateful2octets, discard)
  0.001060 seconds (92.28 k allocations: 1.408 MiB)

julia> @time sextets2octets(chars1_decoded_stateful2octets, discard)
  0.001051 seconds (92.28 k allocations: 1.408 MiB)

Now replace decode_char(char::UInt8) with a function which is just shorter (and, by the way, does not do the job properly, though this does not matter):

decode_char_wrong(char::UInt8)::UInt8 =
    char_0 <= char <= char_z ? char - char_A :
    char == char_plus ? 0x3e :
    char == char_slash ? 0x3f :
    0x00

chars1_decoded_wrong_stateful2octets(con) =
    chars_stateful2octets(chars1, decode_char_wrong, con)

Now the whole process no longer allocates and is about 20 times faster:

julia> @time sextets2octets(chars1_decoded_wrong_stateful2octets, discard)
  0.000054 seconds

julia> @time sextets2octets(chars1_decoded_wrong_stateful2octets, discard)
  0.000053 seconds

Could anybody explain what is going on? Have I overlooked something? I am running Julia 1.11.7 on MacOS Sequoia 15.6.1 on MacBook Pro with Apple M4 Max.

Thanks a lot in advance!

Either one of these two brittle changes eliminates those allocations:

  1. @inline input(pack) and @inline chars_stateful2octets(chars1, decode_char, con). I don’t know if it’s on the Julia side or the LLVM side, but the overall compiler has an internal cost model for inlining small, statically dispatched function calls. @inline is a hint to the compiler to relax the cost model, though it’s not a guarantee the call gets inlined, even if it’s statically known. Inlining was able to put a lot of code in the same place for the compiler to optimize all at once. Code size making a significant runtime impact usually hints at this.

  2. Don’t do Iterators.Stateful, which wasn’t necessary to iterate chars in one go anyway. It does allocate as an external stateful instance, but it doesn’t normally allocate per iteration. This is likely another inlining tipping point.

Among other things, inlining helped prove that pack does not escape the function and optimize its memory. If that doesn’t happen, captured and bits are local variables captured by and reassigned by pack, which generally requires heap allocation and makes Julia’s call-wise type inference impossible (::Core.Box in @code_warntype). The initial 2 heap allocations of the Core.Boxes are actually the least of it, the rest of the allocations occur for the poorly inferred reads and writes every iteration. Your type annotations don’t help those in the current Julia implementation of captured variables, though it does help infer the code around them.

To evade Core.Box, the simplest refactor is to mutate RefValues with concrete types instead of reassigning variables annotated with the same types. You would at most allocate the 2 RefValues up front, then the reads and writes won’t allocate and will be inferred well, even without helping annotations. In this example actually, those 2 allocations are also optimized as if inlining occurred, which further reduced read and write overheads. To see the typical overheads with no allocations, just return pack; for reference, BenchmarkTools.@btime on my machine reports 77.300 μs for inlined/escaped conditions, 446.300 μs for 2 allocated RefValues, and 1.525 ms for the 92280 allocations in your example.

       function sextets2octets(input::Function, output::Function)::Nothing
           captured= Ref(UInt16(0))  # captured bits aligned left, first captured left-most
           bits = Ref(UInt8(0))      # number of bits captured
           function pack(sextet::UInt8)::Nothing
               bits[] += 6
               captured[] |= (UInt16(sextet) << (16 - bits[]))
               if bits[] >= 8
                   output(UInt8(captured[] >> 8))
                   captured[] <<= 8
                   bits[] -= 8
               end
               return nothing
           end
           input(pack)
           return nothing
       end

This just computes a bitshifted value from captured::UInt16, converts to UInt8, and finally maps it to nothing at a dead end. The value assigned to captured is completely unaffected, and this line is probably eliminated by the compiler in pack. No allocations here, but it could be a separate problem.

3 Likes

Just a couple of comments in addition to the most important about captured and bits, and the Stateful iterator.

In general, there is little to gain from annotating variables with a type. The main effect of annotating things like:

captured::UInt16 = ...

is that every assignment

captured = expression

is replaced by

captured = convert(UInt16, expression)

If expression is anyway inferred to be a UInt16, the convert will be optimized away. The same goes for annotating a function with a return type. Every return expr is just replaced by return convert(UInt16, expr).

There is another effect, the compiler will know it’s a UInt16. In a program written for performance it can anyway infer this, so if the annotation speeds things up, there is almost certainly other problems with the code. (As it is in your case, with the boxing of captured and bits).

Typically, it will be better to annotate the right hand side of expressions in doubt:

captured = expression::UInt16

This will insert a type check, and you will get a runtime error if expression is not a UInt16.

I’m also a bit curious why you think channels (and tasks) are slow. They’re usually quite fast, but there is some minor overhead.

3 Likes

Thank you very much for your detailed reply! It makes sense that inlining is the reason for the tipping point. References can also be a solution, though I am not very happy with that: in my opinion, the compiler should optimize in this place. However, I might be contaminated by Python. :grinning_face: Years ago when C was virtually the only useful language, references (pointers) had to be used heavily and I appreciated them, but nowadays, they might not be best programming style. Well, if they solve the problem, it is of course good to have them available, but I presume that there are usually other, nicer solutions.

Of course, Iterators.Stateful is obviously redundant here. Its use arises from a bit more involved case where single steps of iteration are desired. A banal reason to introduce Iterators.Stateful would be that one does not need to distinguish the first step of the iteration from the subsequent ones. However, I thought that Iterators.Stateful has a more important advantage: from the source code, it is visible that it detects the type of the state. It seemed to me that this could create a function barrier, as iterate() could specialize accordingly. I was probably mistaken - I still have a lot to learn how the compiler works.

All the best!

RefValue is not at all the kind of pointers or references to variables in C and other languages with a similar variable concept. Julia’s variable-instance concept is like Python’s, and its references to instances are like Python’s e.g. a Python list is an array of references you assign other objects (the array itself is a reference, in fact CPython classes are all reference types). RefValue is a reference to a single value, and the original example where a nested function reassigns a variable from an enclosing function basically demands a referenced value. The compiler did in fact make a reference for you, that’s what Core.Box is. Your wish that the compiler optimizes this is shared by anyone who encountered this issue; it’s possible in theory but the implementation is complicated.

Julia’s and CPython’s core is implemented in C, so there are internal pointers and there are unsafe ways to interact with them if you need to. Most people should avoid this.

1 Like

Thank you for pointing this out: I was just partially aware of this. After wondering how this difference is visible from the outside, I came to the following example:

julia> x_r = Ref(x)
Base.RefValue{Int64}(3)

julia> x = 43046721
43046721

julia> x_r[]
3

In constrast, the C code:

int x = 3;
int *x_r = &x;
x = 43046721;
*x_r;

would yield 43046721. It might be a good idea to add this example to the documentation. Or have I missed something? :grinning_face:

@inline affects the Julia inliner, you can tell llvm to alwaysinline but @inline doesn’t do that.

1 Like

It seems that we both thought that the compiler should take advantage of the type annotations. However, the profile of the following code (which is more or less the same as the initial one) contradicts it. Though references seem to be a solution, I would prefer that our initial thoughts were right. In other words, if I understand the issue correctly, the title Avoid untyped global variables in the Performance Tips should be changed to Avoid assignments to outer variables - use references instead. Am I right?

The behaviors of variable assignment and Ref(x) mutation are already documented, and it looks and behaves differently from syntax-level pointers in other languages. The only reason you’re comparing the two is the use of the word “reference”, and different contexts will use it to mean different things.

No, these are completely different issues that work completely differently. Typed global variables don’t have the Core.Box read/write issues of typed captured variables.

1 Like

In the documentation to Core.Ref, it really reads:

  • An object that safely references data of type T.
  • Creation of a Ref to a value x of type T …
  • Ref{T}() creates a reference to a value of type T …

For a professional programmer, this probably suffices to fully understand how exactly the references in Julia work at the very first sight. Unfortunately, I am not of this kind: I needed a more detailed explanation, which you kindly provided. :slight_smile: Have I missed such an explanation somewhere else in the documentation? If this is the case, it would be great to insert a link.

Have I understood correctly that this is another difference between local and global variables? Or is there something else behind?

By the way, when learning Julia, I firstly did not see the point in distinguishing between level 0 (global variables) and deeper levels (local variables). However, good reasons to do that are given in the documentation. It was indeed helpful to read the explanation; :slight_smile: if I remember properly, it has been added subsequently.

Thank you once again for your time!

Consider the following modification of the oroginal example where sextets of bits, which are first stored in a vector, are packed into octets and then discarded. Generate the data and define the discarding function:

sextets1::Vector{UInt8} = rand(0x00:0x3f, 65536)

discard(::UInt8) = nothing

Firstly, pass the sextets to a channel, read the channel, pack and eventually discard:

function sextets2octets_from_channel(input::Channel{UInt8}, output::Function)
    captured::UInt16 = UInt16(0)  # captured bits aligned left, first captured left-most
    bits::UInt8 = 0               # number of bits captured
    for sextet::UInt8 in input
        bits += 6
        captured |= (UInt16(sextet) << (16 - bits))
        if bits >= 8
            output(UInt8(captured >> 8))
            captured <<= 8
            bits -= 8
        end
    end
end

channel1(size::Int) = Channel{UInt8}(size) do ch
    for sextet::UInt8 in sextets1
        put!(ch, sextet)
    end
end

discard_sextets_via_channel(size::Int) =
    sextets2octets_from_channel(channel1(size), discard)

Unbuffered channels are really slow:

julia> @time discard_sextets_via_channel(0)
  1.620629 seconds (29 allocations: 1.781 KiB)

A buffer of generic size 1 is not helpful either:

julia> @time discard_sextets_via_channel(1)
  1.621503 seconds (29 allocations: 1.781 KiB)

However, comfortable buffering improves the performance significantly:

@time discard_sextets_via_channel(1024)
## 0.005253 seconds (33 allocations: 4.766 KiB)

@time discard_sextets_via_channel(65536)
## 0.003530 seconds (43 allocations: 267.422 KiB)

Now replace the channel by a function accepting a continuation function:

function sextets2octets_from_generator(input::Function, output::Function)
    captured::UInt16 = UInt16(0)  # captured bits aligned left, first captured left-most
    bits::UInt8 = 0               # number of bits captured
    input() do sextet::UInt8
        bits += 6
        captured |= (UInt16(sextet) << (16 - bits))
        if bits >= 8
            output(UInt8(captured >> 8))
            captured <<= 8
            bits -= 8
        end
    end
end

generator1(con::Function) = foreach(con, sextets1)

discard_sextets_via_generator() =
    sextets2octets_from_generator(generator1, discard)

The whole procedure now runs much faster:

julia> @time discard_sextets_via_generator()
  0.000060 seconds

I am aware that the comparison might not be entirely fair, as the second approach does not allow for step by step iteration. However, the input and output tokens still need not be one-to-one. In other words, the functionality is similar than the one provided by the packages FGenerators and Continuables, which allow for foreach(), but not for the classical for loop.

When step by step iteration is needed, the package ResumableFunctions could be a solution: it is decently fast. However, this package uses more sophistical tools: if I understand it correctly, it actually transforms the code and, perhaps more important, only works under some significant restrictions.

At this point, I cannot help being a bit emotional. From my point of view, Julia is an extremely beautifully and thoughtfully designed language. Its system of tasks and channels is one of the key features demonstrating this: it is very powerful and very easy to learn. When I studied Python’s async features, I got stuck, when I looked up Julia’s tasks and channels, everything was almost obvious. I was absolutely fascinated: for a while, I even thought that channels could generally implement iteration, replacing iterate().

However, when I attempted to reproduce the Base64 package to test my knowledge, I was bitterly disappointed when it took ages to decode a single JPEG image (my channels were unbuffered at that point). Base64 is fast, but its code looks rather sophisticated. Channels allow for elegant and modular code, but unfortunately, the performance cost turned out to be quite significant.

More generally, Julia is commonly advertised as fast. My impression is different: I primarily see it as a beautifully designed language, allowing the programmer to write elegant and readable code. I really enjoy programming in Julia. However, considerable effort may be needed to make the code speedy, unfortunately also at the cost of elegance and readability: just consider Resumable.jl. I can imagine that this can divert many programmers. This would be a great pity: in my opinion, the thoughtful design of Julia allows it to become one of the leading programming languages, widely used and taught at schools, similarly as Python managed some years ago. I would be extremely happy if this happens!

Generally, it is not easy to combine speed and code elegance. However, I believe that with some additional improvements, Julia can indeed make it!

Ok, yes, in this case channels are a lot slower than direct looping. But beware that the discard function is a bit dangerous. It’s inlined, and the entire conversion is skipped.

Here’s my attempt:

using BenchmarkTools
# create function for saving octets. Call with nothing to obtain result
function savethem()
    v = UInt8[]
    sizehint!(v, 65536)
    x -> isnothing(x) ? v : push!(v, x)
end

discard(_) = nothing

function sextets2octets(input, output)
    captured = 0 % UInt16
    bits = 0 % UInt8
    for sextet in input
        bits += 6 % UInt8
        captured |= (UInt16(sextet) << (16 - bits))
        if bits >= 8
            output(UInt8(captured >> 8))
            captured <<= 8
            bits -= 8 % UInt8
        end
    end
end

function convert_sextets(sextets, result, size)
    ch = Channel{UInt8}(size) do ch
        foreach(x -> put!(ch, x), sextets)
    end
    sextets2octets(ch, result)
    return result(nothing)
end


function convert_sextets(sextets, result)
    sextets2octets(sextets, result)
    return result(nothing)
end

const sextets = rand(0x00:0x3f, 65536)

rc = @btime convert_sextets(sextets, savethem(), Inf);
rg = @btime convert_sextets(sextets, savethem());
@assert rc == rg

@btime convert_sextets(sextets, discard, Inf);
@btime convert_sextets(sextets, discard);

@code_native dump_module=false convert_sextets(sextets, discard)

Conversion and saving is 3.2 ms for channel, 97 us for direct conversion, a factor of 33. It’s probably overhead from the atomics involved in read/write from the channel. (Channels are thread safe). No lock conflicts, so it’s resolved with the atomics without event handling.

The discarding takes 2.7 ms for channel, and 1.9 ns for direct conversion. The native code of the direct discard is essentially nothing, everything has been optimized away:

        push    rbp
        mov     rbp, rsp
        mov     rax, qword ptr [r13 + 16]
        mov     rax, qword ptr [rax + 16]
        mov     rax, qword ptr [rax]
        pop     rbp
        ret

The overhead in the channel can be alleviated if several sextets are available at once. A minor change to sextets2octets allows the channel to contain either scalars or iterables. Note the total lack of type annotations:

function sextets2octets(input, output)
    captured = 0
    bits = 0
    for chunk in input, sextet in chunk
        bits += 6
        captured |= (UInt16(sextet) << (16 - bits))
        if bits >= 8
            output((captured >> 8) % UInt8)
            captured <<= 8
            bits -= 8
        end
    end
end

Now, we can create various feeders. We can keep the old one which creates a Channel for single UInt8s, but we can also chunk them, with the same sextets2octets. The named cs is the chunk size:

function convert_sextets_chunks(sextets, result, size; cs=100)
    T = typeof(@view(sextets[begin:end]))
    ch = Channel{T}(size) do ch
        for chunk in (@view(sextets[begin+i*cs : min(begin-1+(i+1)*cs, end)]) 
                      for i in 0:cld(length(sextets),cs))
            put!(ch, chunk)
        end
    end
    sextets2octets(ch, result)
    return result(nothing)
end
@btime convert_sextets_chunks(sextets, savethem(), Inf; cs=1000)

We’re now about 20% slower than the direct loop. Similary, the channel can contain any type of iterator. We can e.g. let the Channel contain iterators which reads a single block from a file, or from some download socket. Still with the same sextets2octets.

1 Like

Julia’s speed has only ever meant that it has an optimizing compiler and semantics to help. Countless other languages are “fast” in the same sense with their specific strengths and tradeoffs, and there is no compiler in existence that can omit the work you wrote your code to do. More specifically, asynchronous task switches will have overheads that a serial loop doesn’t (this includes ResumableFunctions, which doesn’t mix iterations among other concurrent tasks), and shorter channel buffers multiply that overhead. Your foreach benchmark omitted an input to sextets1 to match how you had to fill a Channel{UInt8} with sextets1 as an input, likely a proxy for a more practical stream, but I’d expect that adding a step to copy values from sextets1 to sextets1_2 wouldn’t change the trend of more work taking more time. However magical Julia may seem at times, it is not capable of miracles like doing more work in equal or less time, it just has its own philosophy on how users program within the same practical constraints every one else has.

2 Likes

Yes indeed, it is Julia’s philosophy that fits me incredibly well! :smiley: Thank you for your lessons, I have learned a lot!

At a point where you’re still learning Julia’s semantics and already got frustrated at both implementation shortcomings (type annotations failing to help Core.Box reads and writes) and practical constraints (task-switching overheads of asynchronous Channels), I’d suggest keeping an open mind about the strengths and limitations of Julia or any other languages, as much as you enjoy Julia. The unidiomatic type annotations and the few misassumptions here actually suggest you currently think more in a statically typed language.

1 Like

Yes of course, grouping the input into chunks can improve the speed dramatically. :smiley: Thank you for your suggestion, especially for @view!

Well, there were extensive type annotations because I wanted to stay on the safe side; I was aware that most of them were actually unnecessary, but was not entirely sure which ones. :slight_smile: However, you are partially right: I do prefer static types if the problem can be reasonably solved this way.

Having read all your answers, I think I am now more realistic, but my attitude to Julia stays highly positive. :smiley: I will be, however, delighted if my frustrations happen to help the Julia community in any way.