# Performance of splitting string and parsing numbers

Given a string of numbers that are grouped in blocks, I want to parse the numbers and find the total of each block, and then return the largest value from the totals. Each block is separated by two newline chars, and each number is separate a single newline char e.g.

``````data = "1\n2\n\n3\n4"
``````

My solution is extremely simple:

``````function solve(data)
maximum(eachsplit(data, "\n\n")) do block
sum(parse(Int64, num) for num in eachsplit(block, "\n"))
end
end
``````

# The Problem

But, it’s slow. Here’s the test result from my dataset (255 blocks, 2000 numbers overall):

``````julia> @btime solve(\$data)
309.959 μs (3757 allocations: 234.77 KiB)
``````

The exact same algorithm in Rust is 10x faster:

``````pub fn solve() -> i64 {
return include_str!("../input.txt")
.split("\n\n")
.map(|e| e.lines().map(|c| c.parse::<i64>().unwrap()).sum::<i64>())
.max()
.unwrap();
}
``````

Rust Criterion benchmark result as follows:

``````solve                  time:   [30.432 µs 30.455 µs 30.481 µs]
``````

# An iterative solution

Of course, we Julians can always make our code run fast! So I rewrote the code to solve the problem manually without any iterators/eachsplit functions:

``````function solve2(data)
buf = IOBuffer(data)
largest = 0
total = 0
for line in eachline(buf)
if line == ""  # in between two newline characters
# Update largest number and reset for next block
largest = max(largest, total)
total = 0
continue
end
val = parse(Int, line)
total += val
end
# handle last one
largest = max(largest, total)
return largest
end
``````

And the result is better but not too great yet since it’s still off by 4x from the Rust benchmark:

``````julia> @btime solve2(\$data)
120.917 μs (4520 allocations: 335.45 KiB)
``````

If I just use the regular `split` function rather than creating an IOBuffer and use `eachline`, then the number of allocations is greatly smaller but the runtime is comparable.

``````function solve3(str)
largest = 0
total = 0
for line in split(str, "\n")
if line == ""  # in between two newline characters
# Update largest number and reset for next block
largest = max(largest, total)
total = 0
continue
end
val = parse(Int, line)
total += val
end
# handle last one
largest = max(largest, total)
return largest
end
``````
``````julia> @btime solve3(\$data)
125.000 μs (7 allocations: 160.94 KiB)
``````

# Too good to be true

Interestingly, if I create and pass an `IOBuffer` separately, then the runtime is greatly improved 1,000x.

``````function solve1(buf)
largest = 0
total = 0
for line in eachline(buf)
if line == ""  # in between two newline characters
# Update largest number and reset for next block
largest = max(largest, total)
total = 0
continue
end
val = parse(Int, line)
total += val
end
# handle last one
largest = max(largest, total)
return largest
end
``````
``````julia> @btime solve1(\$buf) setup=(seekstart(\$buf))
138.951 ns (5 allocations: 383 bytes)
``````

That seems too good to be true… And, it does not add up. Just creating an `IOBuffer` wouldn’t take 120 μs. Maybe I’m not measuring it correctly? I don’t really trust this result.

# Optimization ideas?

The final question is how to make it run as fast as Rust. I would be fine if it’s just off by 50% but there’s a 4x difference at this point.

Was this supposed to be `data`?

Sorry, copy/paste error. `data` is just a string. The IOBuffer is created from `data` rather than `input()`.

@Elrod I’ve updated the post (again)

I believe so, you’re missing `evals=1` to the benchmark settings:

``````julia> io = IOBuffer(input());

julia> @btime solve1(\$io) setup=(seekstart(\$io))
141.788 ns (5 allocations: 385 bytes)
75501

julia> @btime solve1(\$io) setup=(seekstart(\$io)) evals=1
122.917 μs (4518 allocations: 335.32 KiB)
75501
``````

If I’m reading the output of `@profview` correctly, all time in your first `solve` function is spent in type inference.

I’m not a Rust expert, but are you sure this is doing anything at runtime? If I put this code into a program

``````pub fn solve() -> i64 {
return include_str!("/tmp/input.txt")
.split("\n\n")
.map(|e| e.lines().map(|c| c.parse::<i64>().unwrap()).sum::<i64>())
.max()
.unwrap();
}

fn main() {
println!("{}", solve());
}
``````

after I compile this program, whenever I run it I get always the same result, even if I edit the file input file, adding some garbage to make parsing fail or some very large numbers to change the result. Or it’s only `include_str!` that is inlined at compile-time and the rest runs at runtime?

2 Likes

I was curious and played around a little bit.

I noticed that a significant amount of time is spent on iterating over `eachsplit` or `split` in general.

It’s really just a guess since I am not really a string expert but Julia natively supports unicode so maybe the additional overhead causes the difference between Rust’s `split` and Julia’s `split`, although Rust also support UTF-8 in `str`.
Btw. a couple of years ago we had `ASCIIString` in Julia which was deprecated somewhere before v1. I am wondering if that would make any difference

…but again, that’s just a guess.

If you compare this (`data` is the content of the provided gist file above):

``````function solve(data)
maximum(eachsplit(data, "\n\n")) do block
sum(parse(Int64, num) for num in eachsplit(block, "\n"))
end
end
``````

I get

``````  491.542 μs (3757 allocations: 234.77 KiB)
``````

and this one

``````@btime collect(eachsplit(\$data, "\n\n"))
``````

``````167.167 μs (2264 allocations: 140.75 KiB)
``````

which is 5 times slower than Rust’s total run time. Then we even have another nested `split` inside `solve` which adds another `150us`-ish to the overall time in case of the Julia implementation.

I am not familiar with the Rust implementation but to me it looks like `split` and probably also `parse` are doing a better job in string processing, but 10x faster is really suspicious.

I did a quick check with a crude bytes-parsing implementation:

``````function splitbytes(bytes, seq)
out = Vector{Vector{UInt8}}()
seq_n = length(seq)
seq_idx = 1
chunk = Vector{UInt8}()
for byte ∈ bytes
# sequence found, pushing everything before it to `out`
if byte == seq[seq_idx] && seq_n == seq_idx
push!(out, chunk[1:end-(seq_n-1)])
seq_idx = 1
chunk = Vector{UInt8}()
continue
end
# reset sequence matcher if consecutive byte is not matching
if byte != seq[seq_idx] && seq_n > 1
seq_idx = 1
end
# increase sequence index to match next time
if byte == seq[seq_idx]
seq_idx += 1
end
push!(chunk, byte)
end
push!(out, chunk)
out
end
``````

and then running that over the very same data and I only get a tiny improvement (20% or so) in runtime and the `Int64` interpretation and `maximum` determination are missing:

``````function solvebytes(bytes)
splitbytes.(splitbytes(bytes, [0xA, 0xA]), [0xA])
end
``````
``````@btime solvebytes(\$bytes)
399.708 μs (7504 allocations: 460.39 KiB)
``````

Anyways, are you sure that Rust is not somehow doing some caching behind the scenes, during compilation? To me it’s the only plausible explanation since I don’t know how to do this iteration including comparisons almost 10x faster…

Ha, you are right You can even remove the `input.txt` and it will still run and give the same result. It seems it reads in the file at compile time and treats it as a constant or so (or at least bakes it into the binary, since it still takes 40us to run, so something happens).

include_str! is a macro invocation.

The file is located relative to the current file (similarly to how modules are found). The provided path is interpreted in a platform-specific way at compile time. So, for instance, an invocation with a Windows path containing backslashes \ would not compile correctly on Unix.

This macro will yield an expression of type `&'static str` which is the contents of the file.

This macro can often be a drop-in replacement for include_str!, switching it to be a run-time rather than compile-time operation.

1 Like

The Rust macro inlines the string at compile time. For a fair comparison, I manually inline the same content as a Julia string so it does not involve any I/O.

Type inference should happen only once though? I want to benchmark the runtime not compile time…

This is faster. (But not very generic)

``````julia> @btime max_num(\$s)
24.435 μs (0 allocations: 0 bytes)
75501
``````

I’ve found that iterating over codeunits if you can is faster. Also `Base.parse` could be faster. It’s not hard to make it faster for parsing binary strings either. Here is an example, although you can get much of the performance gain with much less complexity.

The code is here:

``````to_digit(char_code::Unsigned) = Int(char_code - UInt('0'))
to_digit(char::Char) = to_digit(UInt(char))

_isdigit(char_code::Unsigned) = char_code <= UInt('9') && char_code >= UInt('0')

function _parse_int(v, istart, istop)
n = 0
t = 1
@inbounds for i in istop:-1:istart
n += to_digit(v[i]) * t
t *= 10
end
return n
end

function max_num(_codeunits)
istart = 0
istop = 0
inword = false
max_num = 0
block_num = 0
@inbounds for i in 1:length(_codeunits)
if _isdigit(_codeunits[i])
istop = i
if !inword
istart = i
inword = true
end
else
if inword
block_num += _parse_int(_codeunits, istart, istop)
else
if block_num > max_num
max_num = block_num
end
block_num = 0
end
inword = false
end
end
if block_num > max_num
max_num = block_num
end
return max_num
end

max_num(str::AbstractString) = max_num(codeunits(str))
``````
1 Like

Give atry to this

``````function solvelox(data)
v = transcode(UInt8,data)
largest = 0
total = 0
totpar=0
pre=true
for e in v
if (e == 0x0a)  && pre
largest = max(largest, total)
total = 0
totpar=0
elseif (e == 0x0a)  && !pre
total=total+totpar
pre=true
totpar=0
else
totpar = totpar*10+e-0x30
pre=false
end

end
return max(largest, total+totpar)
end

``````
1 Like
3 Likes

`Parsers.parse` is much better than the last time I tried it.

``````julia> data = "11\n21\n\n3\n34\n5\n\n16\n17\n10"
"11\n21\n\n3\n34\n5\n\n16\n17\n10"

julia> bigdata=(data*"\n\n")^255
"11\n21\n\n3\n34\n5\n\n16\n17\n10\n\n11\n21\n\n3\n34\n5\n\n16\n17\n10\n\n11\n21\n\n3\n34\n5\n\n16\n17\n10\n\n11\n21\n\n3\n34\n5\n\n16\n17\n10\n\n11\n21\n\n3\n34\n5\n\n16\n17\n10\n\n11\n21\n\n3\n34\n5\n\n16\n17\n10\n\n11\n21\n\n3\n34\n5\n\n16\n17\n10\n\n11\n21\n\n3\n34\n5\n\n16\n17\n10\n\n11\n21\n\n3\n34\n5\n\n16\n17\n10\n\n11\n21\n\n3\n34\n5\n\n16\n17\n10\n" ⋯ 5877 bytes ⋯ "1\n21\n\n3\n34\n5\n\n16\n17\n10\n\n11\n21\n\n3\n34\n5\n\n16\n17\n10\n\n11\n21\n\n3\n34\n5\n\n16\n17\n10\n\n11\n21\n\n3\n34\n5\n\n16\n17\n10\n\n11\n21\n\n3\n34\n5\n\n16\n17\n10\n\n11\n21\n\n3\n34\n5\n\n16\n17\n10\n\n11\n21\n\n3\n34\n5\n\n16\n17\n10\n\n11\n21\n\n3\n34\n5\n\n16\n17\n10\n\n11\n21\n\n3\n34\n5\n\n16\n17\n10\n\n11\n21\n\n3\n34\n5\n\n16\n17\n10\n\n"
julia> @btime solvelox(bigdata)
6.960 μs (0 allocations: 0 bytes)
43

julia> @btime max_num(bigdata)
17.300 μs (0 allocations: 0 bytes)
43

``````
1 Like

This is almost twice as fast as the solution I posted. The code I posted can be made faster by annotating functions with `@inline`, but not twice as fast.

One possible difference is that in my code, code units are read from the array twice. (Still, they are never copied so there are no allocations.)

More interesting would be to find a way to match the performance of rust with higher level tools. Or to create such tools.

Exactly. If Rust can do it efficiently with high level iterators and functional programming in just 5 lines of code, it’s hard to believe that Julia cannot match that.

1 Like

Nice algorithm.

Maybe this feels a bit more “high-level”:

``````hilox(data) =
foldl((((largest, total, running), primed), c)->
ifelse(c == UInt8('\n'),
ifelse(primed,
((max(largest, total), 0, 0), false),
((largest, total + running, 0), true)),
((largest, total, 10*running + (c-UInt8('0'))), false)),
Iterators.flatten((transcode(UInt8, data), UInt8('\n'), UInt8('\n')));
init = ((0, 0, 0), false)) |> first |> first
``````

This benchmarks as the fast loopy algorithm, and is actually a one-liner. Eat dust Rust! (just kidding, Rust is also good)

Haha, nice one @Dan, a little bit of Lisp-feeling along the lines

Given all the nice specialised implementations above which outperform Rust are evidence that the compiler is pretty solid and it’s clear that it does some quite nice foldings. I agree with @tk3369 that Julia should be able to do this high-level crunching either, at least, to be honest I would be less surprised if Rust didn’t and Julia did that’s why I was sticking to the straight-forward implementation in my example above because I thought Rust would do something similar.

Anyways, Rust surprises me more often than thought, I should spent more time on that I guess.