Performance of splitting string and parsing numbers

Given a string of numbers that are grouped in blocks, I want to parse the numbers and find the total of each block, and then return the largest value from the totals. Each block is separated by two newline chars, and each number is separate a single newline char e.g.

data = "1\n2\n\n3\n4"

My solution is extremely simple:

function solve(data)
    maximum(eachsplit(data, "\n\n")) do block
        sum(parse(Int64, num) for num in eachsplit(block, "\n"))
    end
end

The Problem

But, it’s slow. Here’s the test result from my dataset (255 blocks, 2000 numbers overall):

julia> @btime solve($data)
  309.959 μs (3757 allocations: 234.77 KiB)

The exact same algorithm in Rust is 10x faster:

pub fn solve() -> i64 {
    return include_str!("../input.txt")
        .split("\n\n")
        .map(|e| e.lines().map(|c| c.parse::<i64>().unwrap()).sum::<i64>())
        .max()
        .unwrap();
}

Rust Criterion benchmark result as follows:

solve                  time:   [30.432 µs 30.455 µs 30.481 µs]

An iterative solution

Of course, we Julians can always make our code run fast! So I rewrote the code to solve the problem manually without any iterators/eachsplit functions:

function solve2(data)
    buf = IOBuffer(data)
    largest = 0
    total = 0
    for line in eachline(buf)
        if line == ""  # in between two newline characters
            # Update largest number and reset for next block
            largest = max(largest, total)
            total = 0
            continue
        end
        val = parse(Int, line)
        total += val
    end
    # handle last one
    largest = max(largest, total)
    return largest
end

And the result is better but not too great yet since it’s still off by 4x from the Rust benchmark:

julia> @btime solve2($data)
  120.917 μs (4520 allocations: 335.45 KiB)

If I just use the regular split function rather than creating an IOBuffer and use eachline, then the number of allocations is greatly smaller but the runtime is comparable.

function solve3(str)
    largest = 0
    total = 0
    for line in split(str, "\n")
        if line == ""  # in between two newline characters
            # Update largest number and reset for next block
            largest = max(largest, total)
            total = 0
            continue
        end
        val = parse(Int, line)
        total += val
    end
    # handle last one
    largest = max(largest, total)
    return largest
end
julia> @btime solve3($data)
  125.000 μs (7 allocations: 160.94 KiB)

Too good to be true

Interestingly, if I create and pass an IOBuffer separately, then the runtime is greatly improved 1,000x.

function solve1(buf)
    largest = 0
    total = 0
    for line in eachline(buf)
        if line == ""  # in between two newline characters
            # Update largest number and reset for next block
            largest = max(largest, total)
            total = 0
            continue
        end
        val = parse(Int, line)
        total += val
    end
    # handle last one
    largest = max(largest, total)
    return largest
end
julia> @btime solve1($buf) setup=(seekstart($buf))
  138.951 ns (5 allocations: 383 bytes)

That seems too good to be true… And, it does not add up. Just creating an IOBuffer wouldn’t take 120 μs. Maybe I’m not measuring it correctly? I don’t really trust this result.

Optimization ideas?

The final question is how to make it run as fast as Rust. I would be fine if it’s just off by 50% but there’s a 4x difference at this point.

You can find the test data here: https://gist.githubusercontent.com/tk3369/566642a9fb13a8f705c7822ce14a8a16/raw/ccdfed4f63b58a75431677a4cd2b73a351e084c2/gistfile1.txt

Was this supposed to be data?

Sorry, copy/paste error. data is just a string. The IOBuffer is created from data rather than input().

@Elrod I’ve updated the post (again) :slight_smile:

I believe so, you’re missing evals=1 to the benchmark settings:

julia> io = IOBuffer(input());

julia> @btime solve1($io) setup=(seekstart($io))
  141.788 ns (5 allocations: 385 bytes)
75501

julia> @btime solve1($io) setup=(seekstart($io)) evals=1
  122.917 μs (4518 allocations: 335.32 KiB)
75501

If I’m reading the output of @profview correctly, all time in your first solve function is spent in type inference.

I’m not a Rust expert, but are you sure this is doing anything at runtime? If I put this code into a program

pub fn solve() -> i64 {
    return include_str!("/tmp/input.txt")
        .split("\n\n")
        .map(|e| e.lines().map(|c| c.parse::<i64>().unwrap()).sum::<i64>())
        .max()
        .unwrap();
}

fn main() {
   println!("{}", solve());
}

after I compile this program, whenever I run it I get always the same result, even if I edit the file input file, adding some garbage to make parsing fail or some very large numbers to change the result. Or it’s only include_str! that is inlined at compile-time and the rest runs at runtime?

2 Likes

I was curious and played around a little bit.

I noticed that a significant amount of time is spent on iterating over eachsplit or split in general.

It’s really just a guess since I am not really a string expert but Julia natively supports unicode so maybe the additional overhead causes the difference between Rust’s split and Julia’s split, although Rust also support UTF-8 in str.
Btw. a couple of years ago we had ASCIIString in Julia which was deprecated somewhere before v1. I am wondering if that would make any difference :wink:

…but again, that’s just a guess.

If you compare this (data is the content of the provided gist file above):

function solve(data)
    maximum(eachsplit(data, "\n\n")) do block
        sum(parse(Int64, num) for num in eachsplit(block, "\n"))
    end
end

I get

  491.542 μs (3757 allocations: 234.77 KiB)

and this one

@btime collect(eachsplit($data, "\n\n"))

already yields

167.167 μs (2264 allocations: 140.75 KiB)

which is 5 times slower than Rust’s total run time. Then we even have another nested split inside solve which adds another 150us-ish to the overall time in case of the Julia implementation.

I am not familiar with the Rust implementation but to me it looks like split and probably also parse are doing a better job in string processing, but 10x faster is really suspicious.

I did a quick check with a crude bytes-parsing implementation:

function splitbytes(bytes, seq)
    out = Vector{Vector{UInt8}}()
    seq_n = length(seq)
    seq_idx = 1
    chunk = Vector{UInt8}()
    for byte ∈ bytes
        # sequence found, pushing everything before it to `out`
        if byte == seq[seq_idx] && seq_n == seq_idx
            push!(out, chunk[1:end-(seq_n-1)])
            seq_idx = 1
            chunk = Vector{UInt8}()
            continue
        end
        # reset sequence matcher if consecutive byte is not matching
        if byte != seq[seq_idx] && seq_n > 1
            seq_idx = 1
        end
        # increase sequence index to match next time
        if byte == seq[seq_idx]
            seq_idx += 1
        end
        push!(chunk, byte)
    end
    push!(out, chunk)
    out
end

and then running that over the very same data and I only get a tiny improvement (20% or so) in runtime and the Int64 interpretation and maximum determination are missing:

function solvebytes(bytes)
    splitbytes.(splitbytes(bytes, [0xA, 0xA]), [0xA])
end
@btime solvebytes($bytes)
  399.708 μs (7504 allocations: 460.39 KiB)

Anyways, are you sure that Rust is not somehow doing some caching behind the scenes, during compilation? To me it’s the only plausible explanation since I don’t know how to do this iteration including comparisons almost 10x faster…

Ha, you are right :wink: You can even remove the input.txt and it will still run and give the same result. It seems it reads in the file at compile time and treats it as a constant or so (or at least bakes it into the binary, since it still takes 40us to run, so something happens).

include_str! is a macro invocation.

The file is located relative to the current file (similarly to how modules are found). The provided path is interpreted in a platform-specific way at compile time. So, for instance, an invocation with a Windows path containing backslashes \ would not compile correctly on Unix.

This macro will yield an expression of type &'static str which is the contents of the file.

It would be informative to use load_str! instead:

This macro can often be a drop-in replacement for include_str!, switching it to be a run-time rather than compile-time operation.

1 Like

The Rust macro inlines the string at compile time. For a fair comparison, I manually inline the same content as a Julia string so it does not involve any I/O.

Type inference should happen only once though? I want to benchmark the runtime not compile time…

This is faster. (But not very generic)

julia> @btime max_num($s)
  24.435 μs (0 allocations: 0 bytes)
75501

I’ve found that iterating over codeunits if you can is faster. Also Base.parse could be faster. It’s not hard to make it faster for parsing binary strings either. Here is an example, although you can get much of the performance gain with much less complexity.

The code is here:

to_digit(char_code::Unsigned) = Int(char_code - UInt('0'))
to_digit(char::Char) = to_digit(UInt(char))

_isdigit(char_code::Unsigned) = char_code <= UInt('9') && char_code >= UInt('0')

function _parse_int(v, istart, istop)
    n = 0
    t = 1
    @inbounds for i in istop:-1:istart
        n += to_digit(v[i]) * t
        t *= 10
    end
    return n
end

function max_num(_codeunits)
    istart = 0
    istop = 0
    inword = false
    max_num = 0
    block_num = 0
    @inbounds for i in 1:length(_codeunits)
        if _isdigit(_codeunits[i])
            istop = i
            if !inword
                istart = i
                inword = true
            end
        else
            if inword
                block_num += _parse_int(_codeunits, istart, istop)
            else
                if block_num > max_num
                    max_num = block_num
                end
                block_num = 0
            end
            inword = false
        end
    end
    if block_num > max_num
        max_num = block_num
    end
    return max_num
 end

max_num(str::AbstractString) = max_num(codeunits(str))
1 Like

Give atry to this

function solvelox(data)
    v = transcode(UInt8,data)
    largest = 0
    total = 0
    totpar=0
    pre=true
    for e in v
        if (e == 0x0a)  && pre
            largest = max(largest, total)
            total = 0
            totpar=0
        elseif (e == 0x0a)  && !pre
            total=total+totpar
            pre=true
            totpar=0
        else
            totpar = totpar*10+e-0x30
            pre=false
        end
        
    end
    return max(largest, total+totpar)
end

1 Like

Use Parsers.parse

3 Likes

Parsers.parse is much better than the last time I tried it.

julia> data = "11\n21\n\n3\n34\n5\n\n16\n17\n10"
"11\n21\n\n3\n34\n5\n\n16\n17\n10"

julia> bigdata=(data*"\n\n")^255
"11\n21\n\n3\n34\n5\n\n16\n17\n10\n\n11\n21\n\n3\n34\n5\n\n16\n17\n10\n\n11\n21\n\n3\n34\n5\n\n16\n17\n10\n\n11\n21\n\n3\n34\n5\n\n16\n17\n10\n\n11\n21\n\n3\n34\n5\n\n16\n17\n10\n\n11\n21\n\n3\n34\n5\n\n16\n17\n10\n\n11\n21\n\n3\n34\n5\n\n16\n17\n10\n\n11\n21\n\n3\n34\n5\n\n16\n17\n10\n\n11\n21\n\n3\n34\n5\n\n16\n17\n10\n\n11\n21\n\n3\n34\n5\n\n16\n17\n10\n" ⋯ 5877 bytes ⋯ "1\n21\n\n3\n34\n5\n\n16\n17\n10\n\n11\n21\n\n3\n34\n5\n\n16\n17\n10\n\n11\n21\n\n3\n34\n5\n\n16\n17\n10\n\n11\n21\n\n3\n34\n5\n\n16\n17\n10\n\n11\n21\n\n3\n34\n5\n\n16\n17\n10\n\n11\n21\n\n3\n34\n5\n\n16\n17\n10\n\n11\n21\n\n3\n34\n5\n\n16\n17\n10\n\n11\n21\n\n3\n34\n5\n\n16\n17\n10\n\n11\n21\n\n3\n34\n5\n\n16\n17\n10\n\n11\n21\n\n3\n34\n5\n\n16\n17\n10\n\n"
julia> @btime solvelox(bigdata)
  6.960 μs (0 allocations: 0 bytes)
43


julia> @btime max_num(bigdata)
  17.300 μs (0 allocations: 0 bytes)
43

1 Like

This is almost twice as fast as the solution I posted. The code I posted can be made faster by annotating functions with @inline, but not twice as fast.

One possible difference is that in my code, code units are read from the array twice. (Still, they are never copied so there are no allocations.)

More interesting would be to find a way to match the performance of rust with higher level tools. Or to create such tools.

Exactly. If Rust can do it efficiently with high level iterators and functional programming in just 5 lines of code, it’s hard to believe that Julia cannot match that.

1 Like

Nice algorithm.

Maybe this feels a bit more “high-level”:

hilox(data) = 
  foldl((((largest, total, running), primed), c)->
    ifelse(c == UInt8('\n'),
      ifelse(primed,
        ((max(largest, total), 0, 0), false),
        ((largest, total + running, 0), true)),
      ((largest, total, 10*running + (c-UInt8('0'))), false)),
    Iterators.flatten((transcode(UInt8, data), UInt8('\n'), UInt8('\n'))); 
    init = ((0, 0, 0), false)) |> first |> first

This benchmarks as the fast loopy algorithm, and is actually a one-liner. Eat dust Rust! (just kidding, Rust is also good)

Haha, nice one @Dan, a little bit of Lisp-feeling along the lines :wink:

Given all the nice specialised implementations above which outperform Rust are evidence that the compiler is pretty solid and it’s clear that it does some quite nice foldings. I agree with @tk3369 that Julia should be able to do this high-level crunching either, at least, to be honest I would be less surprised if Rust didn’t and Julia did :laughing: that’s why I was sticking to the straight-forward implementation in my example above because I thought Rust would do something similar.

Anyways, Rust surprises me more often than thought, I should spent more time on that I guess.