Help with binary trees benchmark games example

This is from the version with str_list::Array{String} = [].

julia> typeof(convert(Array{String}, []))
Vector{String} (alias for Array{String, 1})

You can define a struct with a field like ::Array{String}, and this struct will be capable of holding any dimension of Array with element type String. But you can’t actually construct an array without a dimension, so when you convert it, the original dimension of [], a Vector{Any}, gets preserved.

Thus it is actually concretely typed, and the compiler can figure out the type, making it stable.

8 Likes

Just for the records, this version which concatenates the strings, but does not use string, seems to be as fast as using string:

push!(str_list, "str_list, test_"*"$(i + 3)"*"_somethingsomething")

this probably reduces a little further where to search for optimizations.

1 Like

I am sorry if it sounded like that - it was certainly not my intention. To clarify:

What I have seen here on the Julia discours as well as other platforms is that people are very fond of and very excited about Julia - they love the language. Which is a great thing! But to me it feels like a few people (a minority) are very protective about potential shortcomings of the language and are quick to explain away issues others have, finding workarounds or dismissing them as being non-relevant. It feels like this is partly due how much they love the language but also, as @jzr speculated, due to very different domains, goals and needs colliding:

Julia is a language which many people seem to use mainly for their own very specific use cases (often academic research it seems), so that argument makes sense to me.

And I am saying and discussing that not because I “want people to stop sitting on their hands” (again I am sorry if that is what it sounded like - it certainly was not my intention) but because I think it’s important to acknowledge these issues, even though they seem non-issues to others.
Because from what I have seen so far I really like Julia as a language - it has a lot of great features. But I am (yes, I am exaggerating here a lot) scared of it going down the “Matlab loop route” because people start to accept certain things - e.g. string operations being slow. However, this discussion and also the issues linked to me indicate that there is a LOT going on behind the scenes I did not realize before and that is really good to see.

As said previously, I want to determine if I want to use Julia for my project, which, yes, will include a lot of string operations (some interpolation, much searching, much parsing) but also a lot of creating and deleting of objects. Right now it’s a lot of preprocessing of data, a very dumb algorithm to make sense of that data and a small mircoservice. But it will probably also involve deep learning and metaheuristic experimenting in the future. From what I have read on the Julia main page, Julia should be the perfect fit for that - and Julia already has a great ecosystem for machine learning. Which is why I am that critical about those benchmarks and string processing issues I found while playing around - not because I dislike the language but because right now it would be my number one choice so I need to check for potential roadblocks ahead.

5 Likes

I think the the best reason to choose or not one or other language is if there is a package that is available in a language that is important to your work and cannot be easily used, or efficiently used, in other environments. Something that you are sure you will not go down the road of implementing your self.

Small codes like this are not a good reason to choose or not to choose Julia, because anything can be made fast with some experience, advices, etc. If in the process of developing your code you had found this was a bottleneck, a lot of focus would be put in this part. Julia more or less guarantees you that you will have the tools within the language to write any code as fast as possible, as Fortran, C, C++, etc, do. That does not mean that achieving the best performance in a part of the code will be easy or beginner-frendly, but that is not the case in any language. Writing very fast code is hard in any language.

Thus, I would first look at the package ecosystem. Does it look good to your goals? If it does, the optimization of snippets of code is something that at the end will be possible, even if you have to code some specific function which you originally thought would be ready as it is. Very commonly, with the knowledge of the structure of your problems, what you write is faster, yet less general, than what is available in libraries, and being able to do that in the same language is nice.

3 Likes

There are people who gave up on using Julia because it was too hard to tune the performance of their real code to their needs. Snippets like this are sometimes good predictors (and sometimes bad predictors) of the larger-scale properties of what it’s like to work in a language.

3 Likes

I agree that there’s a lot to be said for the “pit of success” model where what one thinks of writing off the bat is usually good enough performance/correctness wise.

That said, I feel like every thread that has gone off the rails before has done so because it focused on whether Julia should be faster at X rather than what could be done/is already in progress (and importantly, what people/contributions/other resources are needed) to make X faster. Because when the the ratio of julia perf/comparison language perf < 1, the answer to “should it” is almost always “yes”. The stumbling point is the how, and as we’re seeing that’s starting to be explored now that this thread has put string operations into the spotlight.

Edit: the flipside to this is that those of us who don’t have the ability to contribute should calibrate our expectations accordingly. Many threads (not saying this one is, but they’re not hard to find) have devolved into “why haven’t you done this already, don’t you care/know how important X is” and then “Julia is a dead (on arrival) language because of this”. Hopefully that can be avoided here.

7 Likes

Now I have a very specific question on this:

How to find the function that is being called when one does a string interpolation? Because when I try

@edit "a $(1) b"

I am taken to a function that seems to be only the function that prints the result on the REPL.

1 Like

I usually try one of @macroexpand or Meta.@lower to find that. In this case AIUI the latter works because string interpolation is a syntactic construct:

julia> Meta.@lower "a $(1) b"
:($(Expr(:thunk, CodeInfo(
    @ none within `top-level scope'
1 ─ %1 = Base.string("a ", 1, " b")
└──      return %1
))))
4 Likes

It is the same function, and the correct one. I was just being dumb. :slight_smile:

Now this is perhaps more on topic, and who knows, we can learn something from someone here. The function that is called when we do “a$(1)b” is print_string(xs...) from base. It is this function:

julia> function ptinty_string(xs...)
           if isempty(xs)
               return ""
           end
           siz::Int = 0
           for x in xs
               siz += Base._str_sizehint(x)
           end
           # specialized for performance reasons
           s = IOBuffer(sizehint=siz)
           for x in xs
               print(s, x)
           end
           String(resize!(s.data, s.size))
       end

It looks very sophisticated :slight_smile: . Yet, I don’t understand in which sense it is any better than:

julia> function mystring(xs...)
           s = ""
           for x in xs 
             if x isa String
               s *= x
             else
               s *= "$x"  # string(x) here is the same for performance
             end
           end
           s
       end

Given that the last one is quite faster. The stupidloop function runs in 887 ms with the default one, and in 583 ms with the second and simpler one.

Alright, I’ll walk through the process of creating the PR.

We start with the original function:

function stupidloop(num_iterations)
      str_list::Vector{String} = []   # allocate a new emtpy Vector
      for i in 0:num_iterations-1
           push!(str_list, "str_list, test_$(i + 3)_somethingsomething") # push dynamic strings (allocates)
      end

      total_length::Int = 0

      for line in str_list
          total_length += length(line) # accumulate length into a variable (no allocation)
      end

      total_length
end

I’ve marked places that could allocate in comments. We can see from @code_warntype that this function is type stable, so we don’t gain anything here for now. As for allocations, the only places that could allocate are the creation of the vector, creation of the strings to be pushed and pushing itself (since that may have to resize the vector to fit the new element).

We can’t get rid of the allocation of creating the vector unless we preallocate it and pass it in as an argument to the function, so we can ignore that. Since we also don’t want to use the size known beforehand (though that would often be the best idea), we can also ignore possible allocations from push! for this allocation. The only place that’s left for allocations is the creation of the strings themself - and since they rely on dynamic data, we should count at least one allocation for the string itself (ideally it would really only be exactly one, since the static parts of the string could in theory live on the stack, but that’s a story for another time).

If we run some benchmark, we’ll see that for stupidloop(1234567) we get 7406915 allocations reported. One of those is the allocation of str_list, 1234567 should be for creating the strings and at most log2(1234567) should be for growing (since julia’s vectors double in size when necessary). That’s only 1234589 though, so we’re still shy 6172326 allocations somewhere. Let’s eliminate array growing from the allocations by creating an array of the correct size:

function stupidloop2(num_iterations)
      str_list = Vector{String}(undef, num_iterations)   # allocate a new emtpy Vector
      for i in 0:num_iterations-1
           str_list[i+1] = "str_list, test_$(i + 3)_somethingsomething" # write dynamic strings (allocates)
      end

      total_length::Int = 0

      for line in str_list
          total_length += length(line) # accumulate length into a variable (no allocation)
      end

      total_length
end

We now get 7406895 allocations, that’s 20 less than before (which is about log2(1234567), so our knowledge about pushing allocating was correct!). The only place that’s left for allocation is now the creation of the strings, which means we allocate about 6 times per created string! Let’s investigate further:

julia> @benchmark "str_list, test_$(i + 3)_somethingsomething" setup=(i=rand(0:123567))
BenchmarkTools.Trial:                                                                  
  memory estimate:  304 bytes                                                          
  allocs estimate:  5                                                                  
  --------------                                                                       
  minimum time:     211.937 ns (0.00% GC)                                              
  median time:      226.419 ns (0.00% GC)                                              
  mean time:        259.758 ns (8.22% GC)                                              
  maximum time:     10.828 μs (94.62% GC)                                              
  --------------                                                                       
  samples:          10000                                                              
  evals/sample:     511                                                                

So we can confirm that those allocations are from creating the strings. How are those strings created?

julia> i = rand(0:1234566); @code_warntype "str_list, test_$(i + 3)_somethingsomething"
MethodInstance for string(::String, ::Int64, ::String)                                 
  from string(xs...) in Base at strings/io.jl:187                                      
Arguments                                                                              
  #self#::Core.Const(string)                                                           
  xs::Tuple{String, Int64, String}                                                     
Body::String                                                                           
1 ─ %1 = Core._apply_iterate(Base.iterate, Base.print_to_string, xs)::String           
└──      return %1                                                                     

@code_warntype tells us that the method in question is string(::String, ::Int64, ::String) (the first line of output), located in Base at strings/io.jl:187. That line is string(xs...) = print_to_string(xs...), which is also what the output of @code_warntype from above shows (that’s the Core.apply_iterate(Base.iterate...) part, semantically iterating over splatted arguments). So our method call looks like this:

string("str_list, test_", i+3, "_somethingsomething")

which is passed through to print_to_string. Let’s go one level deeper:

@code_warntype Base.print_to_string("str_list, test_", i + 3, "_somethingsomething")
julia> @code_warntype Base.print_to_string("str_list, test_", i + 3, "_somethingsomething")                         
MethodInstance for Base.print_to_string(::String, ::Int64, ::String)                                                
  from print_to_string(xs...) in Base at strings/io.jl:135                                                          
Arguments                                                                                                           
  #self#::Core.Const(Base.print_to_string)                                                                          
  xs::Tuple{String, Int64, String}                                                                                  
Locals                                                                                                              
  @_3::Union{Nothing, Tuple{Union{Int64, String}, Int64}}                                                           
  @_4::Union{Nothing, Tuple{Union{Int64, String}, Int64}}                                                           
  s::IOBuffer                                                                                                       
  siz::Int64                                                                                                        
  x@_7::Union{Int64, String}                                                                                        
  x@_8::Union{Int64, String}                                                                                        
Body::String                                                                                                        
1 ─       Core.NewvarNode(:(@_3))                                                                                   
│         Core.NewvarNode(:(@_4))                                                                                   
│         Core.NewvarNode(:(s))                                                                                     
│         Core.NewvarNode(:(siz))                                                                                   
│   %5  = Base.isempty(xs)::Core.Const(false)                                                                       
└──       goto #3 if not %5                                                                                         
2 ─       Core.Const(:(return ""))                                                                                  
3 ┄ %8  = Base.convert(Base.Int, 0)::Core.Const(0)                                                                  
│         (siz = Core.typeassert(%8, Base.Int))                                                                     
│   %10 = xs::Tuple{String, Int64, String}                                                                          
│         (@_4 = Base.iterate(%10))                                                                                 
│   %12 = (@_4::Core.PartialStruct(Tuple{String, Int64}, Any[String, Core.Const(2)]) === nothing)::Core.Const(false)
│   %13 = Base.not_int(%12)::Core.Const(true)                                                                       
└──       goto #6 if not %13                                                                                        
4 ┄ %15 = @_4::Union{Tuple{Int64, Int64}, Tuple{String, Int64}}                                                     
│         (x@_7 = Core.getfield(%15, 1))                                                                            
│   %17 = Core.getfield(%15, 2)::Int64                                                                              
│   %18 = siz::Int64                                                                                                
│   %19 = Base._str_sizehint(x@_7)::Int64                                                                           
│   %20 = (%18 + %19)::Int64                                                                                        
│   %21 = Base.convert(Base.Int, %20)::Int64                                                                        
│         (siz = Core.typeassert(%21, Base.Int))                                                                    
│         (@_4 = Base.iterate(%10, %17))                                                                            
│   %24 = (@_4 === nothing)::Bool                                                                                   
│   %25 = Base.not_int(%24)::Bool                                                                                   
└──       goto #6 if not %25                                                                                        
5 ─       goto #4                                                                                                   
6 ┄ %28 = (:sizehint,)::Core.Const((:sizehint,))                                                                    
│   %29 = Core.apply_type(Core.NamedTuple, %28)::Core.Const(NamedTuple{(:sizehint,)})                               
│   %30 = Core.tuple(siz)::Tuple{Int64}                                                                             
│   %31 = (%29)(%30)::NamedTuple{(:sizehint,), Tuple{Int64}}                                                        
│   %32 = Core.kwfunc(Base.IOBuffer)::Core.Const(Core.var"#Type##kw"())                                             
│         (s = (%32)(%31, Base.IOBuffer))                                                                           
│   %34 = xs::Tuple{String, Int64, String}                                                                          
│         (@_3 = Base.iterate(%34))                                                                                 
│   %36 = (@_3::Core.PartialStruct(Tuple{String, Int64}, Any[String, Core.Const(2)]) === nothing)::Core.Const(false)
│   %37 = Base.not_int(%36)::Core.Const(true)                                                                       
└──       goto #9 if not %37                                                                                        
7 ┄ %39 = @_3::Union{Tuple{Int64, Int64}, Tuple{String, Int64}}                                                     
│         (x@_8 = Core.getfield(%39, 1))                                                                            
│   %41 = Core.getfield(%39, 2)::Int64                                                                              
│         Base.print(s, x@_8)                                                                                       
│         (@_3 = Base.iterate(%34, %41))                                                                            
│   %44 = (@_3 === nothing)::Bool                                                                                   
│   %45 = Base.not_int(%44)::Bool                                                                                   
└──       goto #9 if not %45                                                                                        
8 ─       goto #7                                                                                                   
9 ┄ %48 = Base.getproperty(s, :data)::Vector{UInt8}                                                                 
│   %49 = Base.getproperty(s, :size)::Int64                                                                         
│   %50 = Base.resize!(%48, %49)::Vector{UInt8}                                                                     
│   %51 = Base.String(%50)::String                                                                                  
└──       return %51                                                                                                

There’s some red in there! So print_to_string is unstable for the arguments we’re given. We can also see that print_to_string(xs...) is still julia code in Base at strings/io.jl:135, so let’s check out the definition:

julia> @edit Base.print_to_string("str_list, test_", i + 3, "_somethingsomething")

which opens the default editor configured (nano in my case) at the function definition that would be called. It’s this function:

function print_to_string(xs...)          
    if isempty(xs)                       
        return ""    # our xs is not empty, so we can ignore this                    
    end                                  
    siz::Int = 0                         
    for x in xs                          
        siz += _str_sizehint(x)          
    end                                  
    # specialized for performance reasons     # original comment
    s = IOBuffer(sizehint=siz)   # allocates an IOBuffer with an buffer        
    for x in xs                          
        print(s, x)   # prints to the IOBuffer                   
    end                                  
    String(resize!(s.data, s.size))  # allocates a new string
end                                      

This function seems to iterate over its arguments (and twice at that!). Since we have a mix of types in there, that loop is necessarily type unstable, since each iteration over xs might give a different x with different type - there’s not much we can do about that, unfortunately.

Judging from the comment, it seems like there were already performance considerations done when this code was written - so there must be something tricky going on. We can still do the same kind of analysis of this code as we could with out original stupidloop though - after all, this is just regular ol’ julia code, like any we would write ourselves!

The first loop seems to get some sort of size estimate for the IOBuffer (just like we did earlier with our preallocation of the output size). It’s just adding numbers, there are no allocations here. The IOBuffer should thus already be created with the correct size (should be one allocation again). We’re passing in a String, and Int64 and another String. How are those heuristics determined for those types? Let’s check @edit Base._str_sizehint(1) and find out:

function _str_sizehint(x)                         
    if x isa Float64                              
        return 20                                 
    elseif x isa Float32                          
        return 12                                 
    elseif x isa String || x isa SubString{String}
        return sizeof(x)                          
    elseif x isa Char                             
        return ncodeunits(x)                      
    else                                          
        return 8                                  
    end                                           
end                                               

Ok, so this function takes anything (the x is untyped, so Any), checks their types and returns some values based on that. For String, that’s just the size of the string. This is already the perfect value for that, since that’s precisely how much space it takes. For Int64 though, we fall back to the else return 8 clause - but surely our digits in base 10 can be larger than 8 for large numbers, right? ceil(Int, log10(typemax(Int64))) = 19, plus one for a potential sign character, so we may need up to 20 characters for any given Int64. If we underestimate the size here, the IOBuffer in print_to_string that called this method would be too small and thus would have to grow it’s internal buffer by allocating. This affects numbers larger than 99.999.999, or numbers greater than ~2^26.5 (the majority of Int64 values, since they grow up to 2^63). To prevent that, I started a PR by simply returning 20 (maximum size necessary) in the case of UInt64 or Int64, which is the PR linked above.

This doesn’t actually help our case though, since 1234569 (the maximum value that would be passed to string from our loop) is far smaller than 2^26.5 - so there has to be something else going on. Going back to the print_to_string we had above:

Click for the function again
function print_to_string(xs...)          
    if isempty(xs)                       
        return ""    # our xs is not empty, so we can ignore this                    
    end                                  
    siz::Int = 0                         
    for x in xs                          
        siz += _str_sizehint(x)          
    end                                  
    # specialized for performance reasons     # original comment
    s = IOBuffer(sizehint=siz)   # allocates an IOBuffer with an buffer        
    for x in xs                          
        print(s, x)   # prints to the IOBuffer                   
    end                                  
    String(resize!(s.data, s.size))  # allocates a new string
end  

There’s not much left to check here, so let’s write a small test function that mimics this behaviour:

function test_io(io_buffer, xs...)
   for x in xs
       print(io_buffer, x)
   end
end

We do nothing more than print to the IOBuffer, since that’s the only place were we haven’t spotted an allocation yet. String(resize!(s.data, s.size)) only allocates once, IOBuffer(sizehint=siz) allocates once (that’s the internal buffer; the allocation of the IOBuffer can be elided since it doesn’t escape print_to_string. This sadly isn’t quite possible for the internal Vectors…), so we’re still missing 3 allocations that can’t otherwise be explained. Sure enough, the function allocates those:

julia> @btime test_io(io, vals...) setup=(io=IOBuffer(sizehint=200);vals=["str_list, test_", rand(3:1234569), "_somethingsomething"]) evals=1
  1.600 μs (3 allocations: 112 bytes)                                                                                                                                                                                          

At this point, we’ve nailed it down to print allocating when writing into an IOBuffer (which it shouldn’t! There’s already an issue for that though and fixing that one is indeed more complicated.). We’ve done nothing more than checking the source code, repeated thinking “what could possibly allocate?”, testing that hypothesis, and starting again one level deeper. To me, that’s much more accessible than having to guess, especially since all tools used for that are the exact same tools I’d use to debug my own code. Sometimes that debugging just happens to lead into finding inefficencies in Base, and to me that’s a good thing, because it means that this stuff is accessible to “regular” developers of code. There’s no special knowledge needed to find this. Of course, that doesn’t mean that we can fix this ourselves like in the case of _str_sizehint every time, but reporting those kinds of fundamental issues is often a gateway to improvements that can help in a bunch of other cases that hit those code paths as well.


I don’t know how the Rust version works in detail (mostly because I don’t know Rust), but judging from the way it’s written (the interpolated argument gets passed as its own argument), the following should be much closer to what Rust is doing:

using Printf # stdlib

function stupidloop(num_iterations)
      str_list::Vector{String} = []   # allocate a new emtpy Vector
      for i in 0:num_iterations-1
           push!(str_list, @sprintf "str_list, test_%i_somethingsomething" (i+3)) # push dynamic strings (allocates)
      end

      total_length::Int = 0

      for line in str_list
          total_length += length(line) # accumulate length into a variable (no allocation)
      end

      total_length
end

and indeed, that’s 0.392s on my machine. I hope the “regular” string interpolation can catch up to this! Until then, I’ll just use the Printf stdlib for performance critical string work.

Well, that was a deep dive! I hope this was insightful to you. :slight_smile:

30 Likes

Yes it was helpful, thanks for all that info! I’ll have to review it to make sure I follow everything but already quite interesting. Cheers.

First, thank you very much for the nice class there!

But why the “regular” interpolation isn’t something simpler, as I posted above, which is already faster?

(In similar situations I have finally found that the “regular” implementation is faster for most cases but not for the very particular one used in the test that was being discussed. Usually small toy problems.

Now this is something I will test, is that string interpolation still slower than others if many values and strings are interpolated at once?)

Add-on:

For example, if one adds the interpolation of something that is not an integer, the Julia implementation is way faster than the Python one:

julia> function stupidloop(num_iterations)
           str_list = String[]
           for i in 0:num_iterations-1
               push!(str_list, "str_list $(3.14)")
           end
           total_length = 0
           for line in str_list
               total_length += length(line)
           end
           total_length
       end
stupidloop (generic function with 1 method)

julia> stupidloop(1234)
16042

julia> @benchmark stupidloop(1234)
BenchmarkTools.Trial: 
  memory estimate:  765.14 KiB
  allocs estimate:  6181
  --------------
  minimum time:     249.096 μs (0.00% GC)
  median time:      264.541 μs (0.00% GC)
  mean time:        291.903 μs (7.54% GC)
  maximum time:     2.453 ms (86.09% GC)
  --------------
  samples:          10000
  evals/sample:     1


In [37]: def stupidloop(num_iterations):
    ...:     str_list = []
    ...:     for i in range(num_iterations):
    ...:       str_list.append(f"str_list {3.14}")
    ...: 
    ...:     total_length = 0
    ...: 
    ...:     for line in str_list:
    ...:         total_length += len(line)
    ...: 
    ...: 
    ...:     return total_length
    ...: 

In [38]: stupidloop(1234)
Out[38]: 16042

In [39]: %timeit stupidloop(1234)
511 µs ± 5.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

The string IO/processing code is quite old at this point (stuff from 2015 mixed with a small improvement in 2018), but it also has a different characteristic in mind: Large numbers of interpolations. Your version is indeed faster for the small case, but I think it’s going to be slower for large numbers of arguments that aren’t already strings, since those will have to allocate a string representation first, followed by an allocation for the result of the concatenation. That’s going to be a performance bottleneck. You can find some more discussion about the original reason this code was written the way it was here. The goal seems to have been to speed it up for mixed arguments (which is reasonable, considering this gets called for interpolation).

2 Likes

Indeed. The Python implementation is, at the same time, fast for integers, but for floats it is quite slow. I think there are always trade-offs when implementing these things envisaging all the possible applications that they may have in the future.