Alright, I’ll walk through the process of creating the PR.
We start with the original function:
function stupidloop(num_iterations)
str_list::Vector{String} = [] # allocate a new emtpy Vector
for i in 0:num_iterations-1
push!(str_list, "str_list, test_$(i + 3)_somethingsomething") # push dynamic strings (allocates)
end
total_length::Int = 0
for line in str_list
total_length += length(line) # accumulate length into a variable (no allocation)
end
total_length
end
I’ve marked places that could allocate in comments. We can see from @code_warntype
that this function is type stable, so we don’t gain anything here for now. As for allocations, the only places that could allocate are the creation of the vector, creation of the strings to be pushed and pushing itself (since that may have to resize the vector to fit the new element).
We can’t get rid of the allocation of creating the vector unless we preallocate it and pass it in as an argument to the function, so we can ignore that. Since we also don’t want to use the size known beforehand (though that would often be the best idea), we can also ignore possible allocations from push!
for this allocation. The only place that’s left for allocations is the creation of the strings themself - and since they rely on dynamic data, we should count at least one allocation for the string itself (ideally it would really only be exactly one, since the static parts of the string could in theory live on the stack, but that’s a story for another time).
If we run some benchmark, we’ll see that for stupidloop(1234567)
we get 7406915
allocations reported. One of those is the allocation of str_list
, 1234567 should be for creating the strings and at most log2(1234567) should be for growing (since julia’s vectors double in size when necessary). That’s only 1234589
though, so we’re still shy 6172326 allocations somewhere. Let’s eliminate array growing from the allocations by creating an array of the correct size:
function stupidloop2(num_iterations)
str_list = Vector{String}(undef, num_iterations) # allocate a new emtpy Vector
for i in 0:num_iterations-1
str_list[i+1] = "str_list, test_$(i + 3)_somethingsomething" # write dynamic strings (allocates)
end
total_length::Int = 0
for line in str_list
total_length += length(line) # accumulate length into a variable (no allocation)
end
total_length
end
We now get 7406895
allocations, that’s 20 less than before (which is about log2(1234567)
, so our knowledge about pushing allocating was correct!). The only place that’s left for allocation is now the creation of the strings, which means we allocate about 6 times per created string! Let’s investigate further:
julia> @benchmark "str_list, test_$(i + 3)_somethingsomething" setup=(i=rand(0:123567))
BenchmarkTools.Trial:
memory estimate: 304 bytes
allocs estimate: 5
--------------
minimum time: 211.937 ns (0.00% GC)
median time: 226.419 ns (0.00% GC)
mean time: 259.758 ns (8.22% GC)
maximum time: 10.828 μs (94.62% GC)
--------------
samples: 10000
evals/sample: 511
So we can confirm that those allocations are from creating the strings. How are those strings created?
julia> i = rand(0:1234566); @code_warntype "str_list, test_$(i + 3)_somethingsomething"
MethodInstance for string(::String, ::Int64, ::String)
from string(xs...) in Base at strings/io.jl:187
Arguments
#self#::Core.Const(string)
xs::Tuple{String, Int64, String}
Body::String
1 ─ %1 = Core._apply_iterate(Base.iterate, Base.print_to_string, xs)::String
└── return %1
@code_warntype
tells us that the method in question is string(::String, ::Int64, ::String)
(the first line of output), located in Base
at strings/io.jl:187
. That line is string(xs...) = print_to_string(xs...)
, which is also what the output of @code_warntype
from above shows (that’s the Core.apply_iterate(Base.iterate...)
part, semantically iterating over splatted arguments). So our method call looks like this:
string("str_list, test_", i+3, "_somethingsomething")
which is passed through to print_to_string
. Let’s go one level deeper:
@code_warntype Base.print_to_string("str_list, test_", i + 3, "_somethingsomething")
julia> @code_warntype Base.print_to_string("str_list, test_", i + 3, "_somethingsomething")
MethodInstance for Base.print_to_string(::String, ::Int64, ::String)
from print_to_string(xs...) in Base at strings/io.jl:135
Arguments
#self#::Core.Const(Base.print_to_string)
xs::Tuple{String, Int64, String}
Locals
@_3::Union{Nothing, Tuple{Union{Int64, String}, Int64}}
@_4::Union{Nothing, Tuple{Union{Int64, String}, Int64}}
s::IOBuffer
siz::Int64
x@_7::Union{Int64, String}
x@_8::Union{Int64, String}
Body::String
1 ─ Core.NewvarNode(:(@_3))
│ Core.NewvarNode(:(@_4))
│ Core.NewvarNode(:(s))
│ Core.NewvarNode(:(siz))
│ %5 = Base.isempty(xs)::Core.Const(false)
└── goto #3 if not %5
2 ─ Core.Const(:(return ""))
3 ┄ %8 = Base.convert(Base.Int, 0)::Core.Const(0)
│ (siz = Core.typeassert(%8, Base.Int))
│ %10 = xs::Tuple{String, Int64, String}
│ (@_4 = Base.iterate(%10))
│ %12 = (@_4::Core.PartialStruct(Tuple{String, Int64}, Any[String, Core.Const(2)]) === nothing)::Core.Const(false)
│ %13 = Base.not_int(%12)::Core.Const(true)
└── goto #6 if not %13
4 ┄ %15 = @_4::Union{Tuple{Int64, Int64}, Tuple{String, Int64}}
│ (x@_7 = Core.getfield(%15, 1))
│ %17 = Core.getfield(%15, 2)::Int64
│ %18 = siz::Int64
│ %19 = Base._str_sizehint(x@_7)::Int64
│ %20 = (%18 + %19)::Int64
│ %21 = Base.convert(Base.Int, %20)::Int64
│ (siz = Core.typeassert(%21, Base.Int))
│ (@_4 = Base.iterate(%10, %17))
│ %24 = (@_4 === nothing)::Bool
│ %25 = Base.not_int(%24)::Bool
└── goto #6 if not %25
5 ─ goto #4
6 ┄ %28 = (:sizehint,)::Core.Const((:sizehint,))
│ %29 = Core.apply_type(Core.NamedTuple, %28)::Core.Const(NamedTuple{(:sizehint,)})
│ %30 = Core.tuple(siz)::Tuple{Int64}
│ %31 = (%29)(%30)::NamedTuple{(:sizehint,), Tuple{Int64}}
│ %32 = Core.kwfunc(Base.IOBuffer)::Core.Const(Core.var"#Type##kw"())
│ (s = (%32)(%31, Base.IOBuffer))
│ %34 = xs::Tuple{String, Int64, String}
│ (@_3 = Base.iterate(%34))
│ %36 = (@_3::Core.PartialStruct(Tuple{String, Int64}, Any[String, Core.Const(2)]) === nothing)::Core.Const(false)
│ %37 = Base.not_int(%36)::Core.Const(true)
└── goto #9 if not %37
7 ┄ %39 = @_3::Union{Tuple{Int64, Int64}, Tuple{String, Int64}}
│ (x@_8 = Core.getfield(%39, 1))
│ %41 = Core.getfield(%39, 2)::Int64
│ Base.print(s, x@_8)
│ (@_3 = Base.iterate(%34, %41))
│ %44 = (@_3 === nothing)::Bool
│ %45 = Base.not_int(%44)::Bool
└── goto #9 if not %45
8 ─ goto #7
9 ┄ %48 = Base.getproperty(s, :data)::Vector{UInt8}
│ %49 = Base.getproperty(s, :size)::Int64
│ %50 = Base.resize!(%48, %49)::Vector{UInt8}
│ %51 = Base.String(%50)::String
└── return %51
There’s some red in there! So print_to_string
is unstable for the arguments we’re given. We can also see that print_to_string(xs...)
is still julia code in Base at strings/io.jl:135, so let’s check out the definition:
julia> @edit Base.print_to_string("str_list, test_", i + 3, "_somethingsomething")
which opens the default editor configured (nano in my case) at the function definition that would be called. It’s this function:
function print_to_string(xs...)
if isempty(xs)
return "" # our xs is not empty, so we can ignore this
end
siz::Int = 0
for x in xs
siz += _str_sizehint(x)
end
# specialized for performance reasons # original comment
s = IOBuffer(sizehint=siz) # allocates an IOBuffer with an buffer
for x in xs
print(s, x) # prints to the IOBuffer
end
String(resize!(s.data, s.size)) # allocates a new string
end
This function seems to iterate over its arguments (and twice at that!). Since we have a mix of types in there, that loop is necessarily type unstable, since each iteration over xs
might give a different x
with different type - there’s not much we can do about that, unfortunately.
Judging from the comment, it seems like there were already performance considerations done when this code was written - so there must be something tricky going on. We can still do the same kind of analysis of this code as we could with out original stupidloop
though - after all, this is just regular ol’ julia code, like any we would write ourselves!
The first loop seems to get some sort of size estimate for the IOBuffer (just like we did earlier with our preallocation of the output size). It’s just adding numbers, there are no allocations here. The IOBuffer
should thus already be created with the correct size (should be one allocation again). We’re passing in a String, and Int64 and another String. How are those heuristics determined for those types? Let’s check @edit Base._str_sizehint(1)
and find out:
function _str_sizehint(x)
if x isa Float64
return 20
elseif x isa Float32
return 12
elseif x isa String || x isa SubString{String}
return sizeof(x)
elseif x isa Char
return ncodeunits(x)
else
return 8
end
end
Ok, so this function takes anything (the x
is untyped, so Any
), checks their types and returns some values based on that. For String
, that’s just the size of the string. This is already the perfect value for that, since that’s precisely how much space it takes. For Int64
though, we fall back to the else return 8
clause - but surely our digits in base 10 can be larger than 8 for large numbers, right? ceil(Int, log10(typemax(Int64))) = 19
, plus one for a potential sign character, so we may need up to 20 characters for any given Int64. If we underestimate the size here, the IOBuffer
in print_to_string
that called this method would be too small and thus would have to grow it’s internal buffer by allocating. This affects numbers larger than 99.999.999
, or numbers greater than ~2^26.5 (the majority of Int64 values, since they grow up to 2^63). To prevent that, I started a PR by simply returning 20 (maximum size necessary) in the case of UInt64
or Int64
, which is the PR linked above.
This doesn’t actually help our case though, since 1234569
(the maximum value that would be passed to string
from our loop) is far smaller than 2^26.5 - so there has to be something else going on. Going back to the print_to_string
we had above:
Click for the function again
function print_to_string(xs...)
if isempty(xs)
return "" # our xs is not empty, so we can ignore this
end
siz::Int = 0
for x in xs
siz += _str_sizehint(x)
end
# specialized for performance reasons # original comment
s = IOBuffer(sizehint=siz) # allocates an IOBuffer with an buffer
for x in xs
print(s, x) # prints to the IOBuffer
end
String(resize!(s.data, s.size)) # allocates a new string
end
There’s not much left to check here, so let’s write a small test function that mimics this behaviour:
function test_io(io_buffer, xs...)
for x in xs
print(io_buffer, x)
end
end
We do nothing more than print to the IOBuffer
, since that’s the only place were we haven’t spotted an allocation yet. String(resize!(s.data, s.size))
only allocates once, IOBuffer(sizehint=siz)
allocates once (that’s the internal buffer; the allocation of the IOBuffer
can be elided since it doesn’t escape print_to_string
. This sadly isn’t quite possible for the internal Vectors…), so we’re still missing 3 allocations that can’t otherwise be explained. Sure enough, the function allocates those:
julia> @btime test_io(io, vals...) setup=(io=IOBuffer(sizehint=200);vals=["str_list, test_", rand(3:1234569), "_somethingsomething"]) evals=1
1.600 μs (3 allocations: 112 bytes)
At this point, we’ve nailed it down to print
allocating when writing into an IOBuffer
(which it shouldn’t! There’s already an issue for that though and fixing that one is indeed more complicated.). We’ve done nothing more than checking the source code, repeated thinking “what could possibly allocate?”, testing that hypothesis, and starting again one level deeper. To me, that’s much more accessible than having to guess, especially since all tools used for that are the exact same tools I’d use to debug my own code. Sometimes that debugging just happens to lead into finding inefficencies in Base, and to me that’s a good thing, because it means that this stuff is accessible to “regular” developers of code. There’s no special knowledge needed to find this. Of course, that doesn’t mean that we can fix this ourselves like in the case of _str_sizehint
every time, but reporting those kinds of fundamental issues is often a gateway to improvements that can help in a bunch of other cases that hit those code paths as well.
I don’t know how the Rust version works in detail (mostly because I don’t know Rust), but judging from the way it’s written (the interpolated argument gets passed as its own argument), the following should be much closer to what Rust is doing:
using Printf # stdlib
function stupidloop(num_iterations)
str_list::Vector{String} = [] # allocate a new emtpy Vector
for i in 0:num_iterations-1
push!(str_list, @sprintf "str_list, test_%i_somethingsomething" (i+3)) # push dynamic strings (allocates)
end
total_length::Int = 0
for line in str_list
total_length += length(line) # accumulate length into a variable (no allocation)
end
total_length
end
and indeed, that’s 0.392s on my machine. I hope the “regular” string interpolation can catch up to this! Until then, I’ll just use the Printf stdlib for performance critical string work.
Well, that was a deep dive! I hope this was insightful to you.