Help with binary trees benchmark games example

you mean String, and yes, I think people recognize String related stuff isn’t one of Julia’s strongest area. Makes sense, given the limited resources the community have, String is not likely to be the bottleneck of “real-world” scientific computing work load anyway.

3 Likes

I can reproduce @Manuel_Bergmann’s benchmark results in Python 3.7 and Julia 1.5 on Ryzen 9. I also tried removing the print calls but it didn’t change the results.

1 Like

Imo it would be unfortunate if Julia didn’t expand beyond scientific computing. With some effort in various areas it could be appealing to a much broader community, like game dev, GUI, etc.

1 Like

String handling is performance sensitive in many cases. For example people have put a lot of effort into fast parsing (as in simdjson).

4 Likes

Databases are another good example of performance critical string processing.

3 Likes

Adding type annotations only helps a little.

function stupidloop(num_iterations)
           str_list::Array{String} = []
           for i in 0:num_iterations-1
               push!(str_list, "str_list, test_$(i + 3)_somethingsomething")
           end

           total_length::Int = 0

           for line in str_list
               total_length += length(line)
           end

       total_length
       end


julia> @benchmark stupidloop(1234567)
BenchmarkTools.Trial: 
  memory estimate:  374.92 MiB
  allocs estimate:  7406915
  --------------
  minimum time:     925.693 ms (33.29% GC)
  median time:      1.008 s (43.79% GC)
  mean time:        1.002 s (41.64% GC)
  maximum time:     1.094 s (46.58% GC)
  --------------
  samples:          5
  evals/sample:     1

each String element doesn’t have fixed number of bytes so this is somewhat expected

With a type declaration on str_list to placate @code_warntype and converting the number to a string concatenation before interpolation:

function stupidloop(num_iterations)
    str_list::Vector{String} = []
    for i in 0:num_iterations-1
        push!(str_list, "str_list, test_$(string(i + 3))_somethingsomething")
    end

    total_length = 0

    for line in str_list
        total_length += length(line)
    end

    println(total_length)
end

Results in:


BenchmarkTools.Trial: 
  memory estimate:  205.38 MiB
  allocs estimate:  3703731
  --------------
  minimum time:     153.273 ms (0.00% GC)
  median time:      260.305 ms (50.85% GC)
  mean time:        238.720 ms (41.12% GC)
  maximum time:     299.565 ms (55.47% GC)
  --------------
  samples:          21
  evals/sample:     1

Compared to a mean of ~300 ms for the Python version.

Looking at the difference between the generated IR in both versions, the int interpolation appears to invoke

Base.print_to_string("str_list, test_"::String, %17::Vararg{Any, N} where N, "_somethingsomething")::String

(defined here), while the one with a pre-conversion to string invokes

Base.string("str_list, test_"::String, %19::String, "_somethingsomething"::Vararg{String, N} where N)::String

(defined here). It would appear that going through the IO machinery is a good 2-3x slower than hitting the string-only concat path. I should also note that using * also lowers to string and performs comparably well (likely because you’re forced to convert i + 1 to use it).

This doesn’t make up the deficit with the Rust implementation, but at least it consistently beats the Python one. A quick glance at the LLVM IR didn’t have anything jump out, so I’ll defer to someone more knowledgeable about this.

3 Likes

I wonder why string interpolation goes through I/O machinery.

It doesn’t always, as evidenced by the string-only interpolation being converted into the optimized string(::Union{Char, String, SubString{String}}...) overload instead of the default one. You can see this via Meta.@lower:

julia> Meta.@lower "str_list, test_$(i + 3)_somethingsomething"
:($(Expr(:thunk, CodeInfo(
    @ none within `top-level scope'
1 ─ %1 = i + 3
│   %2 = Base.string("str_list, test_", %1, "_somethingsomething")
└──      return %2
))))

julia> Meta.@lower "str_list, test_$(string(i + 3))_somethingsomething"
:($(Expr(:thunk, CodeInfo(
    @ none within `top-level scope'
1 ─ %1 = i + 3
│   %2 = string(%1)
│   %3 = Base.string("str_list, test_", %2, "_somethingsomething")
└──      return %3
))))

It’s perhaps illustrative that Python does a similar thing (convert the number, then do a string concat):

>>> def f(i):
...   return f"str_list, test_{i + 3}_somethingsomething"
... 
>>> f(1)
'str_list, test_4_somethingsomething'
>>> import dis
>>> dis.dis(f)
  2           0 LOAD_CONST               1 ('str_list, test_')
              2 LOAD_FAST                0 (i)
              4 LOAD_CONST               2 (3)
              6 BINARY_ADD
              8 FORMAT_VALUE             0
             10 LOAD_CONST               3 ('_somethingsomething')
             12 BUILD_STRING             3
             14 RETURN_VALUE

For completeness, here’s a version with Printf.@sprintf:

julia> function stupidloop_int(num_iterations)
           str_list::Vector{String} = []
           for i in 0:num_iterations-1
               push!(str_list, @sprintf "str_list, test_%d_somethingsomething" i + 3)
           end

           total_length = 0

           for line in str_list
               total_length += length(line)
           end

           println(total_length)
       end
stupidloop_int (generic function with 1 method)

julia> @benchmark stupidloop_int(1234567)
...
BenchmarkTools.Trial: 
  memory estimate:  167.71 MiB
  allocs estimate:  2469164
  --------------
  minimum time:     241.034 ms (28.22% GC)
  median time:      275.573 ms (34.95% GC)
  mean time:        287.272 ms (35.18% GC)
  maximum time:     336.174 ms (41.12% GC)
  --------------
  samples:          18
  evals/sample:     1
1 Like

If all arguments are strings or substrings the total length of the resulting string can easily be calculated and allocated upfront. That’s what the code in julia/substring.jl at 6aaedecc447e3d8226d5027fb13d0c3cbfbfea2a · JuliaLang/julia · GitHub does. If there are other types in the string constructor the string is built piece by piece (with an IOBuffer).

The perf difference that sometimes occurs because of this is captured in Surprising performance difference between string(...) and *(...) for concatenation · Issue #29550 · JuliaLang/julia · GitHub.

As I said in that issue, it might be possible to be smarter here.

I think many people will disagree with you on that. Strings are important and Julia should be fast at them.

7 Likes

This is almost never the correct way to initialize an array in Julia because it has element type Any, which is slow because every operation needs runtime dispatch.
Explicit type annotation of str_list fixes this, as pointed out before.
An alternative is to define

str_list = String[]

As shown by @ToucheSir this change already gives better performance than Python.

2 Likes

Hm… that’s a weird argument - following that train of thought you could also claim that all other benchmarks don’t test Julia

Sorry, I may not have written that very well :wink: I wanted to emphasize 2 things:

  1. This benchmark has been quoted before with the takeaway “oh, julia is bad at calculating something like mandelbrot”, which isn’t the conclusion to draw from this benchmark, since its benchmarking a brainfuck implementation. So if there’s a take away, it should be that Julia is bad at writing brainfuck interpreters

  2. There is a Julia brainfuck implementation which is faster by an order of magnitude, making that benchmark pretty meaningless when talking about Julias performance compared to other languages

1 Like

I am frankly amazed by the very productive and positive discussion my post triggered! :slight_smile:

I disagree for the reasons people have already discussed - String performance is an issue for many applications.

I second this. At least it should not be surprisingly slow, when you don’t do xyz - xyz in this case being converting the int to str in the string interpolation before the interpolation happens. This is what I was trying to point out with my comment.

No it does not, as tested by @jzr and also by me - the main performance improvement @ToucheSir achieved stems from:

Which is not something you would normally do, want to do or even (should) know/have to worry about.

Yes, but that’s not the issue. The issue is: Why is this piece of code so slow, when there is seemingly nothing wrong with it - at least nothing the people here could point to (or did I misread the comments before)? Point is: Can this be completely rewritten to be “magnitudes faster”? Yes. Should the original code be that slow? Probably not, and people here were confused why it is. And since nobody seems to know why it is that slow, it will be hard to impossible to avoid this kind of slow code in reality without lots of experimentation to find out what is why slow and what is not - that’s the point I am making. :slightly_smiling_face:

1 Like

Yes, you are right. There is a significant performance improvement from specifying the array type (from 716 to 643 ms on my machine), but the effect of string conversion in interpolation is the larger one.

1 Like

I think some performance loss is due to push!/pop! always does a ccall · Issue #24909 · JuliaLang/julia · GitHub. This implementation does a large number of calls to push!, pop! etc which changes the array length and currently this calls into the Julia runtime which has a bit of overhead. In the other languages, they implement this themselves e.g.

Just hacking in something like that I got the code from 2.3s → 1.8s.

3 Likes

Though I appreciate the discussion, I’m not sure this is quite a fair characterization of the issue at hand. Breaking things down:

Both Kristoffer and I pointed out that this kind of numeric interpolation hits a slower path in the string processing code. If you look at the issue he linked, someone has already put in place a PR to address part of it. The remaining performance gap is discussed in this PR comment.

I think part of the reason you’re getting pushback on this kind of claim is because it gives off a bit of a “well it should be this way, so why are you all oblivious to it/sitting on your hands?” tone. There are a fair few topics on Discourse about not-so-great performance in the IO functions, so the real question is what would be required from a resourcing/implementation/design perspective to get in there and optimize them.

1 Like

The conclusion I draw from examples like these is broader:

  1. currently, writing straightforward code can have lower performance in Julia than pure CPython on some workloads;
  2. it can be hard for Julia novices to guess what causes their slowness;
  3. experts are sometimes available to help diagnose performance problems;
  4. once the cause is diagnosed, it is sometimes easy to make the code much faster with a small change like adding string().

It’s good that people have found a productive response to identifying issues as you point out. However, I think @Manuel_Bergmann’s comment is responding to a different aspect of community interactions, in which users from domain X push back against requests from domain Y who have a different computational workload. E.g. numeric work vs text work, batch jobs vs real-time jobs, etc. It makes sense to ensure performance experts’ precious time is focused on optimizing the code paths that people actually use, and to push back against optimizing for irrelevant benchmarks. Doing that is a good thing.

But of course nobody can be familiar with every domain or workload, so it can be helpful to explicitly communicate “this is a toy version of a real problem faced by my domain and I would like Julia to support it” in order to reach across the difference in perspectives. That’s what I think @Manuel_Bergmann has been working toward in his messages.

3 Likes

Hello, author of that PR here :slight_smile:

First of all, that PR really covers less than you seem to interpret here - it only affects integers larger than ~2^26.5 and definitely isn’t the main cause of the slowdown.

Secondly, I don’t think fixing the other issue I linked in that comment will magically fix performance - print_to_string is (out of necessity) type unstable here, so there will probably always be some dynamic dispatch that will slow things down compared to having only string type arguments.

I don’t think those are true. The most basic suggestions (e.g. String[]) are covered in the oft linked Performance Tips (though I have to say, that section should be cleaned up a little). Deeper investigation (regarding why string construction here is slow) was not more than basic @code_warntype and @time (wondering why allocations went up with larger inputs, seeing that it calculated them and noticing that the calculation wasn’t accurate in a large amount of cases). These are not hard steps to take - there is no C code to read, it’s all julia and nothing new to learn other than what should already be standard tooling for regular julia (package/project) development.

The reason your annotation didn’t help as much as you’d hoped (and other versions that still interpolate the integer get) is because Array{String} is an abstract type - it does not have the second type parameter (specifying the dimensionality). You were probably looking for Array{String, 1}, or equivalently (and what other people here used), Vector{String}.


Lastly, just because some people argue for why a change hasn’t been made doesn’t mean that change isn’t wanted at all - it’s just that there are not enough people “taking the leap” and trying to decipher these kinds of problems themselves (and subsequently trying to merge a partial fix) that there are only a limited number of people working on any part at a given time, most of which work on much bigger picture things than this specific edge case (that’s not to say these problems aren’t known - sometimes the good solutions are harder to implement though and require much more work from the true experts than it seems at first).

All that being said, I’d love for string and IO to be faster.

I hope this was helpful :slight_smile:

6 Likes

I agree that Vector{String} is what I should have used, but I don’t think it actually helps in this case?

julia> function stupidloop(num_iterations)
                  str_list::Array{String} = []
                  for i in 0:num_iterations-1
                      push!(str_list, "str_list, test_$(i + 3)_somethingsomething")
                  end

                  total_length::Int = 0

                  for line in str_list
                      total_length += length(line)
                  end

              total_length
              end
stupidloop (generic function with 1 method)

julia> 

julia> using BenchmarkTools

julia> stupidloop(1234567)
49506155

julia> @benchmark stupidloop(1234567)
BenchmarkTools.Trial: 
  memory estimate:  374.92 MiB
  allocs estimate:  7406915
  --------------
  minimum time:     969.548 ms (37.52% GC)
  median time:      997.991 ms (40.08% GC)
  mean time:        1.016 s (41.16% GC)
  maximum time:     1.111 s (46.10% GC)
  --------------
  samples:          5
  evals/sample:     1

julia> function stupidloop(num_iterations)
                  str_list::Vector{String} = []
                  for i in 0:num_iterations-1
                      push!(str_list, "str_list, test_$(i + 3)_somethingsomething")
                  end

                  total_length::Int = 0

                  for line in str_list
                      total_length += length(line)
                  end

              total_length
              end
stupidloop (generic function with 1 method)

julia> stupidloop(1234567)
49506155

julia> @benchmark stupidloop(1234567)
BenchmarkTools.Trial: 
  memory estimate:  374.92 MiB
  allocs estimate:  7406915
  --------------
  minimum time:     986.241 ms (37.14% GC)
  median time:      1.095 s (44.64% GC)
  mean time:        1.072 s (42.75% GC)
  maximum time:     1.131 s (46.05% GC)
  --------------
  samples:          5
  evals/sample:     1

I’m not sure what I’m supposed to get from @code_warntype here?


julia> function stupidloop(num_iterations)
                  str_list::Vector{String} = []
                  for i in 0:num_iterations-1
                      push!(str_list, "str_list, test_$(i + 3)_somethingsomething")
                  end

                  total_length::Int = 0

                  for line in str_list
                      total_length += length(line)
                  end

              total_length
              end
stupidloop (generic function with 1 method)

julia> @code_warntype stupidloop(1234567)
Variables
  #self#::Core.Compiler.Const(stupidloop, false)
  num_iterations::Int64
  str_list::Array{String,1}
  @_4::Union{Nothing, Tuple{Int64,Int64}}
  total_length::Int64
  @_6::Union{Nothing, Tuple{String,Int64}}
  i::Int64
  line::String

Body::Int64
1 ─       Core.NewvarNode(:(total_length))
│         Core.NewvarNode(:(@_6))
│   %3  = Base.vect()::Array{Any,1}
│   %4  = Core.apply_type(Main.Vector, Main.String)::Core.Compiler.Const(Array{String,1}, false)
│   %5  = Base.convert(%4, %3)::Array{String,1}
│         (str_list = Core.typeassert(%5, %4))
│   %7  = (num_iterations - 1)::Int64
│   %8  = (0:%7)::Core.Compiler.PartialStruct(UnitRange{Int64}, Any[Core.Compiler.Const(0, false), Int64])
│         (@_4 = Base.iterate(%8))
│   %10 = (@_4 === nothing)::Bool
│   %11 = Base.not_int(%10)::Bool
└──       goto #4 if not %11
2 ┄ %13 = @_4::Tuple{Int64,Int64}::Tuple{Int64,Int64}
│         (i = Core.getfield(%13, 1))
│   %15 = Core.getfield(%13, 2)::Int64
│   %16 = str_list::Array{String,1}
│   %17 = (i + 3)::Int64
│   %18 = Base.string("str_list, test_", %17, "_somethingsomething")::String
│         Main.push!(%16, %18)
│         (@_4 = Base.iterate(%8, %15))
│   %21 = (@_4 === nothing)::Bool
│   %22 = Base.not_int(%21)::Bool
└──       goto #4 if not %22
3 ─       goto #2
4 ┄ %25 = Base.convert(Main.Int, 0)::Core.Compiler.Const(0, false)
│         (total_length = Core.typeassert(%25, Main.Int))
│   %27 = str_list::Array{String,1}
│         (@_6 = Base.iterate(%27))
│   %29 = (@_6 === nothing)::Bool
│   %30 = Base.not_int(%29)::Bool
└──       goto #7 if not %30
5 ┄ %32 = @_6::Tuple{String,Int64}::Tuple{String,Int64}
│         (line = Core.getfield(%32, 1))
│   %34 = Core.getfield(%32, 2)::Int64
│   %35 = total_length::Int64
│   %36 = Main.length(line)::Int64
│   %37 = (%35 + %36)::Int64
│   %38 = Base.convert(Main.Int, %37)::Int64
│         (total_length = Core.typeassert(%38, Main.Int))
│         (@_6 = Base.iterate(%27, %34))
│   %41 = (@_6 === nothing)::Bool
│   %42 = Base.not_int(%41)::Bool
└──       goto #7 if not %42
6 ─       goto #5
7 ┄       return total_length



julia> function stupidloop(num_iterations)
                  str_list::Array{String} = []
                  for i in 0:num_iterations-1
                      push!(str_list, "str_list, test_$(i + 3)_somethingsomething")
                  end

                  total_length::Int = 0

                  for line in str_list
                      total_length += length(line)
                  end

              total_length
              end
stupidloop (generic function with 1 method)

julia> @code_warntype stupidloop(1234567)
Variables
  #self#::Core.Compiler.Const(stupidloop, false)
  num_iterations::Int64
  str_list::Array{String,1}
  @_4::Union{Nothing, Tuple{Int64,Int64}}
  total_length::Int64
  @_6::Union{Nothing, Tuple{String,Int64}}
  i::Int64
  line::String

Body::Int64
1 ─       Core.NewvarNode(:(total_length))
│         Core.NewvarNode(:(@_6))
│   %3  = Base.vect()::Array{Any,1}
│   %4  = Core.apply_type(Main.Array, Main.String)::Core.Compiler.Const(Array{String,N} where N, false)
│   %5  = Base.convert(%4, %3)::Array{String,1}
│         (str_list = Core.typeassert(%5, %4))
│   %7  = (num_iterations - 1)::Int64
│   %8  = (0:%7)::Core.Compiler.PartialStruct(UnitRange{Int64}, Any[Core.Compiler.Const(0, false), Int64])
│         (@_4 = Base.iterate(%8))
│   %10 = (@_4 === nothing)::Bool
│   %11 = Base.not_int(%10)::Bool
└──       goto #4 if not %11
2 ┄ %13 = @_4::Tuple{Int64,Int64}::Tuple{Int64,Int64}
│         (i = Core.getfield(%13, 1))
│   %15 = Core.getfield(%13, 2)::Int64
│   %16 = str_list::Array{String,1}
│   %17 = (i + 3)::Int64
│   %18 = Base.string("str_list, test_", %17, "_somethingsomething")::String
│         Main.push!(%16, %18)
│         (@_4 = Base.iterate(%8, %15))
│   %21 = (@_4 === nothing)::Bool
│   %22 = Base.not_int(%21)::Bool
└──       goto #4 if not %22
3 ─       goto #2
4 ┄ %25 = Base.convert(Main.Int, 0)::Core.Compiler.Const(0, false)
│         (total_length = Core.typeassert(%25, Main.Int))
│   %27 = str_list::Array{String,1}
│         (@_6 = Base.iterate(%27))
│   %29 = (@_6 === nothing)::Bool
│   %30 = Base.not_int(%29)::Bool
└──       goto #7 if not %30
5 ┄ %32 = @_6::Tuple{String,Int64}::Tuple{String,Int64}
│         (line = Core.getfield(%32, 1))
│   %34 = Core.getfield(%32, 2)::Int64
│   %35 = total_length::Int64
│   %36 = Main.length(line)::Int64
│   %37 = (%35 + %36)::Int64
│   %38 = Base.convert(Main.Int, %37)::Int64
│         (total_length = Core.typeassert(%38, Main.Int))
│         (@_6 = Base.iterate(%27, %34))
│   %41 = (@_6 === nothing)::Bool
│   %42 = Base.not_int(%41)::Bool
└──       goto #7 if not %42
6 ─       goto #5
7 ┄       return total_length

I don’t think so. People see benchmarks that look irrelevant to them and argue against focusing on those benchmarks. This is fine and good, since performance optimization takes time away from other priorities and complicates the language implementation. So it’s useful to for users with different needs to communicate about which parts of the language they need to be fast.

1 Like