Fastest way to parse a string of numbers

I have a string of space separated numbers like this:

"1.0    2.34345              7.9"

And I want to parse it. Currently I use the following code for this:

function parse_numbers(s)
    pieces = split(s, ' ', keepempty=false)
    map(pieces) do piece
        parse(Float64, piece)
    end
end

using BenchmarkTools
s = join(randn(1000), "  ")

@benchmark parse_numbers(s)
BenchmarkTools.Trial: 
  memory estimate:  116.59 KiB
  allocs estimate:  4912
  --------------
  minimum time:     335.254 μs (0.00% GC)
  median time:      342.025 μs (0.00% GC)
  mean time:        359.650 μs (2.53% GC)
  maximum time:     2.430 ms (82.97% GC)
  --------------
  samples:          10000
  evals/sample:     1

Is there a faster way?

2 Likes
using BenchmarkTools

function parse_numbers(s)
    pieces = split(s, ' ', keepempty=false)
    map(pieces) do piece
        parse(Float64, piece)
    end
end

function parse_numbers2(s)
    matches = eachmatch(r"-?\d+\.?\d*", s)
    gen = (parse(Float64, m.match) for m in matches)
    collect(gen)
end

s = "1.0    2.34345              7.9"

@btime parse_numbers($s);
# 2.553 μs
@btime parse_numbers2($s);
# 2.066 μs
4 Likes

It seems however, that this does not scale as well:

julia> s = join(randn(1000), ' ');

julia> @btime parse_numbers($s);
  279.502 μs (2959 allocations: 86.08 KiB)

julia> @btime parse_numbers2($s);
  402.775 μs (4013 allocations: 250.95 KiB)

The Parsers.jl package is setup to handle these kind of parsing scenarios quite well. In this case, you can do something like:

parser = Parsers.Delimited(" "; ignorerepeated=true)
io = IOBuffer("1.0    2.34345              7.9")
Parsers.parse(parser, io, Float64) # returns 1.0
Parsers.parse(parser, io, Float64) # returns 2.34345
Parsers.parse(parser, io, Float64) # returns 7.9

It’s a bit tricky to benchmark because it relies on parsing from an IOBuffer, but it’s also very fast because it’s making exactly one pass over the strings in parsing all three numbers.

Let me know if you have any questions or concerns.

6 Likes

From

Parsers.parse(parser, io, Float64)

I get

Parsers.Result{Float64}(1.0, 9, 0).

How do I access 1.0?

julia> fieldnames(Parsers.Result)
(:result, :code, :pos)
julia> a = Parsers.Result{Float64}(1.0, 9, 0);
julia> a.result
1.0
2 Likes

Can this be used to parse a string of delimited mixed types like Integers, Floats and Strings?

Sure thing. Instead of Parsers.parse(parser, io, Float64, just substitute Parsers.parse(parser, io, Int64) or Parsers.parse(parser, io, String).

4 Likes

I was wondering more about a mixed string like in this question which hasn’t been answered yet

For exampe, a string of mixed types like “1,apples,oranges,4,5,6.78,bananas”. Is there a way to map it into

Int,String,String,Int,Int,Float,String

It seems that the parsers solution is in fact slower then the baseline

julia> using Parsers

julia> function baseline(s)
           pieces = split(s, ' ', keepempty=false)
           map(pieces) do piece
               parse(Float64, piece)
           end
       end
baseline (generic function with 1 method)

julia> function parsers(io::IO, p)
           ret = Float64[]
           while !eof(io)
               x = Parsers.parse(p, io, Float64)
               push!(ret, x.result)
           end
           ret
       end
parsers (generic function with 1 method)

julia> p = Parsers.Delimited(" ", ignorerepeated=true);

julia> N=10^6; s = join(randn(N)," "); io = IOBuffer(s); seekstart(io); @time parsers(io, p); @time baseline(s);
  1.507161 seconds (12.82 M allocations: 317.485 MiB, 12.21% gc time)
  0.406526 seconds (3.29 M allocations: 91.532 MiB)

julia> N=10^6; s = join(randn(N)," "); io = IOBuffer(s); seekstart(io); @time parsers(io, p); @time baseline(s);
  0.428294 seconds (9.79 M allocations: 169.320 MiB)
  0.304757 seconds (3.00 M allocations: 77.665 MiB)

julia> N=10^6; s = join(randn(N)," "); io = IOBuffer(s); seekstart(io); @time parsers(io, p); @time baseline(s);
  0.683015 seconds (9.83 M allocations: 165.939 MiB, 30.17% gc time)
  0.371718 seconds (3.00 M allocations: 77.665 MiB, 21.19% gc time)

julia> N=10^6; s = join(randn(N)," "); io = IOBuffer(s); seekstart(io); @time parsers(io, p); @time baseline(s);
  0.668701 seconds (9.81 M allocations: 165.713 MiB, 30.10% gc time)
  0.366405 seconds (3.00 M allocations: 77.665 MiB, 20.59% gc time)

Yeah, unfortunately you’re hitting a currently open issue involving float parsing performance when dealing with full-precision floats (i.e. > 15 digits, which is what you typically get with rand()). You can see the difference by rounding to 15 digits of precision:

julia> N=10^6; s = join(round.(randn(N), digits=15)," "); io = IOBuffer(s); seekstart(io); @time parsers(io, p); @time baseline(s);
  0.047656 seconds (24 allocations: 9.001 MiB)
  0.355156 seconds (3.00 M allocations: 77.665 MiB, 33.97% gc time)
4 Likes

Yes, how the Parsers.jl package helps here is allowing IO-based parsing. So in a case where you have a line in a file like “1,apples,oranges,6.78”, you could have something like:

function parserow(parser, io)
    a = Parsers.parse(parser, io, Int64).result
    b = Parsers.parse(parser, io, String).result
    c = Parsers.parse(parser, io, String).result
    d = Parsers.parse(parser, io, Float64).result
    return (col1=a, col2=b, col3=c, col4=d)
end
1 Like

Thanks! I’ll give it a shot.

Hi, quinnj
Im trying to use Parsers as in example that you provided:

julia> using Parsers

julia> parser = Parsers.Delimited(" "; ignorerepeated=true)

I get error:

ERROR: UndefVarError: Delimited not defined
Stacktrace:
 [1] getproperty(x::Module, f::Symbol)
   @ Base .\Base.jl:26
 [2] top-level scope
   @ REPL[21]:1

Could you please point to what Im doing wrong here?

Hi @sergeant , the API for the Parsers.jl package went through a big upgrade; it now has a single Parsers.Options struct to hold any configuration options and you use Parsers.parse or Parsers.tryparse to pass a custom options config and get the parsing result back. So the original example I posted would be more like:

opts = Parsers.Options(delim=' ', ignorerepeated=true)
io = IOBuffer("1.0    2.34345              7.9")
Parsers.parse(Float64, io, opts) # returns 1.0
Parsers.parse(Float64, io, opts) # returns 2.34345
Parsers.parse(Float64, io, opts) # returns 7.9
4 Likes

Thank you @quinnj !

This fails for Parsers v2.1.1

julia> opts = Parsers.Options(delim=' ', ignorerepeated=true)
ERROR: ArgumentError: whitespace characters (`wh1=' '` and `wh2='\t'` default keyword arguments) must be different than delim argument

Yeah, you’ll just need to do opts = Parsers.Options(delim=' ', ignorerepeated=true, wh1=0x00)

4 Likes

In the following, Parsers is faster than Base.parse for one line. But, when reading a file, it is much slower. (EDIT: Removed useless lines)

 """
    write_test_file(fname, nlines=10_000)

write `nlines` lines of data to a simplified dimacs file.
"""
function write_test_file(fname, nlines=10_000)
    nmax = 10_000
    open(fname, "w") do io
        for _ in 1:nlines
            (a, b, c) = rand(1:nmax, 3)
            println(io, "a ", a, " ", b, " ", c)
        end
    end
    return nothing
end

test_parse_line1(line::AbstractString, args...) = test_parse_line1(IOBuffer(line), args...)

"""
    test_parse_line1(io::IO, lineno=0)

Read three integers from a line of the form "a n1 n2 n3\n", consuming all characters.
The line taken from the front of the buffer `io`. Parsing is done by `Parsers`.
"""
function test_parse_line1(io::IO, lineno=0)
    char = read(io, Char)
    if char != 'a'
        throw(ErrorException("Expecting line beginning with 'a' in line $(lineno)"))
    end
    from_node = Parsers.parse(Int, io)
    to_node = Parsers.parse(Int, io)
    weight = Parsers.parse(Int, io)
    read(io, Char)
    return (from_node, to_node, weight)
end

"""
    test_parse_line2(io::IO, lineno=0)

Read three integers from a line of the form "a n1 n2 n3\n".
The line is an entire string.
"""
function test_parse_line2(line::AbstractString, lineno=0)
    char = line[1]
    if char != 'a'
        throw(ErrorException("Expecting line beginning with 'a' in line $(lineno)"))
    end
    (_, x, y, z) = split(line)
    from_node = parse(Int, x)
    to_node = parse(Int, y)
    weight = parse(Int, z)
    return (from_node, to_node, weight)
end

"""
    read_test_file2(path=path)

Read a file a line at a time into a string. Parse
each string into three integers.
"""
function read_test_file2(path=path)
    lcount = 0
    local res
    open(path, "r") do io
        while ! eof(io)
            line = readline(io)
            res = test_parse_line2(line, lcount)
            lcount += 1
        end
    end
    println(lcount)
    println(res)
end


"""
    read_test_file1(path=path)

Read lines from a file, parsing each line into three integers.
Pass over each character only once.
"""
function read_test_file1(path=path)
    lcount = 0
    local res
    open(path, "r") do io
        while ! eof(io)
            res = test_parse_line1(io, lcount)
            lcount += 1
        end
    end
    println(lcount)
    println(res)
end
julia> write_test_file("tfile.gr", 1_000_000)

julia> xx = "a 1234 223423 3234325\n";

julia> @btime test_parse_line1($xx)
  48.260 ns (2 allocations: 128 bytes)
(1234, 223423, 3234325)

julia> @btime test_parse_line2($xx)
  392.443 ns (2 allocations: 272 bytes)
(1234, 223423, 3234325)

julia> @time read_test_file1("tfile.gr")
1000000
1000000
(8274, 8890, 965)
  6.893523 seconds (3.01 M allocations: 61.368 MiB, 0.06% gc time, 0.32% compilation time)

julia> @time read_test_file2("tfile.gr")
1000000
1000000
(8274, 8890, 965)
  0.812866 seconds (6.00 M allocations: 362.354 MiB, 2.91% gc time, 2.30% compilation time)

I tried using test_parse_line1 in read_test_file2. (This requires removing the last read(io, Char) from test_parse_line1, which reads a newline, because it is already removed with readline.) Then the file is read and parsed much faster. The time to process the file is reduced from 0.812 s to 0.28 s. But, it still is wasteful because I am passing over the bytes twice. First read a line into a string with readline. Then wrap the string in an IOBuffer and send it to test_parse_line1, which uses Parser.