Either one of these two brittle changes eliminates those allocations:
-
@inline input(pack)and@inline chars_stateful2octets(chars1, decode_char, con). I don’t know if it’s on the Julia side or the LLVM side, but the overall compiler has an internal cost model for inlining small, statically dispatched function calls.@inlineis a hint to the compiler to relax the cost model, though it’s not a guarantee the call gets inlined, even if it’s statically known. Inlining was able to put a lot of code in the same place for the compiler to optimize all at once. Code size making a significant runtime impact usually hints at this. -
Don’t do
Iterators.Stateful, which wasn’t necessary to iteratecharsin one go anyway. It does allocate as an external stateful instance, but it doesn’t normally allocate per iteration. This is likely another inlining tipping point.
Among other things, inlining helped prove that pack does not escape the function and optimize its memory. If that doesn’t happen, captured and bits are local variables captured by and reassigned by pack, which generally requires heap allocation and makes Julia’s call-wise type inference impossible (::Core.Box in @code_warntype). The initial 2 heap allocations of the Core.Boxes are actually the least of it, the rest of the allocations occur for the poorly inferred reads and writes every iteration. Your type annotations don’t help those in the current Julia implementation of captured variables, though it does help infer the code around them.
To evade Core.Box, the simplest refactor is to mutate RefValues with concrete types instead of reassigning variables annotated with the same types. You would at most allocate the 2 RefValues up front, then the reads and writes won’t allocate and will be inferred well, even without helping annotations. In this example actually, those 2 allocations are also optimized as if inlining occurred, which further reduced read and write overheads. To see the typical overheads with no allocations, just return pack; for reference, BenchmarkTools.@btime on my machine reports 77.300 μs for inlined/escaped conditions, 446.300 μs for 2 allocated RefValues, and 1.525 ms for the 92280 allocations in your example.
function sextets2octets(input::Function, output::Function)::Nothing
captured= Ref(UInt16(0)) # captured bits aligned left, first captured left-most
bits = Ref(UInt8(0)) # number of bits captured
function pack(sextet::UInt8)::Nothing
bits[] += 6
captured[] |= (UInt16(sextet) << (16 - bits[]))
if bits[] >= 8
output(UInt8(captured[] >> 8))
captured[] <<= 8
bits[] -= 8
end
return nothing
end
input(pack)
return nothing
end
This just computes a bitshifted value from captured::UInt16, converts to UInt8, and finally maps it to nothing at a dead end. The value assigned to captured is completely unaffected, and this line is probably eliminated by the compiler in pack. No allocations here, but it could be a separate problem.