Fastest way to parse a string of numbers

jw3126 · October 24, 2018, 8:23am

I have a string of space separated numbers like this:

"1.0    2.34345              7.9"

And I want to parse it. Currently I use the following code for this:

function parse_numbers(s)
    pieces = split(s, ' ', keepempty=false)
    map(pieces) do piece
        parse(Float64, piece)
    end
end

using BenchmarkTools
s = join(randn(1000), "  ")

@benchmark parse_numbers(s)

BenchmarkTools.Trial: 
  memory estimate:  116.59 KiB
  allocs estimate:  4912
  --------------
  minimum time:     335.254 μs (0.00% GC)
  median time:      342.025 μs (0.00% GC)
  mean time:        359.650 μs (2.53% GC)
  maximum time:     2.430 ms (82.97% GC)
  --------------
  samples:          10000
  evals/sample:     1

Is there a faster way?

mohamed82008 · October 24, 2018, 10:24am

using BenchmarkTools

function parse_numbers(s)
    pieces = split(s, ' ', keepempty=false)
    map(pieces) do piece
        parse(Float64, piece)
    end
end

function parse_numbers2(s)
    matches = eachmatch(r"-?\d+\.?\d*", s)
    gen = (parse(Float64, m.match) for m in matches)
    collect(gen)
end

s = "1.0    2.34345              7.9"

@btime parse_numbers($s);
# 2.553 μs
@btime parse_numbers2($s);
# 2.066 μs

jw3126 · October 24, 2018, 11:39am

It seems however, that this does not scale as well:

julia> s = join(randn(1000), ' ');

julia> @btime parse_numbers($s);
  279.502 μs (2959 allocations: 86.08 KiB)

julia> @btime parse_numbers2($s);
  402.775 μs (4013 allocations: 250.95 KiB)

quinnj · October 27, 2018, 3:46am

The Parsers.jl package is setup to handle these kind of parsing scenarios quite well. In this case, you can do something like:

parser = Parsers.Delimited(" "; ignorerepeated=true)
io = IOBuffer("1.0    2.34345              7.9")
Parsers.parse(parser, io, Float64) # returns 1.0
Parsers.parse(parser, io, Float64) # returns 2.34345
Parsers.parse(parser, io, Float64) # returns 7.9

It’s a bit tricky to benchmark because it relies on parsing from an IOBuffer, but it’s also very fast because it’s making exactly one pass over the strings in parsing all three numbers.

Let me know if you have any questions or concerns.

oliver · October 27, 2018, 4:19am

From

Parsers.parse(parser, io, Float64)

I get

Parsers.Result{Float64}(1.0, 9, 0).

How do I access 1.0?

tlienart · October 27, 2018, 6:08am

julia> fieldnames(Parsers.Result)
(:result, :code, :pos)
julia> a = Parsers.Result{Float64}(1.0, 9, 0);
julia> a.result
1.0

jonjilla · October 27, 2018, 7:31am

Can this be used to parse a string of delimited mixed types like Integers, Floats and Strings?

quinnj · October 27, 2018, 7:39am

Sure thing. Instead of Parsers.parse(parser, io, Float64, just substitute Parsers.parse(parser, io, Int64) or Parsers.parse(parser, io, String).

jonjilla · October 27, 2018, 7:43am

I was wondering more about a mixed string like in this question which hasn’t been answered yet

For exampe, a string of mixed types like “1,apples,oranges,4,5,6.78,bananas”. Is there a way to map it into

Int,String,String,Int,Int,Float,String

jw3126 · October 27, 2018, 1:10pm

It seems that the parsers solution is in fact slower then the baseline

julia> using Parsers

julia> function baseline(s)
           pieces = split(s, ' ', keepempty=false)
           map(pieces) do piece
               parse(Float64, piece)
           end
       end
baseline (generic function with 1 method)

julia> function parsers(io::IO, p)
           ret = Float64[]
           while !eof(io)
               x = Parsers.parse(p, io, Float64)
               push!(ret, x.result)
           end
           ret
       end
parsers (generic function with 1 method)

julia> p = Parsers.Delimited(" ", ignorerepeated=true);

julia> N=10^6; s = join(randn(N)," "); io = IOBuffer(s); seekstart(io); @time parsers(io, p); @time baseline(s);
  1.507161 seconds (12.82 M allocations: 317.485 MiB, 12.21% gc time)
  0.406526 seconds (3.29 M allocations: 91.532 MiB)

julia> N=10^6; s = join(randn(N)," "); io = IOBuffer(s); seekstart(io); @time parsers(io, p); @time baseline(s);
  0.428294 seconds (9.79 M allocations: 169.320 MiB)
  0.304757 seconds (3.00 M allocations: 77.665 MiB)

julia> N=10^6; s = join(randn(N)," "); io = IOBuffer(s); seekstart(io); @time parsers(io, p); @time baseline(s);
  0.683015 seconds (9.83 M allocations: 165.939 MiB, 30.17% gc time)
  0.371718 seconds (3.00 M allocations: 77.665 MiB, 21.19% gc time)

julia> N=10^6; s = join(randn(N)," "); io = IOBuffer(s); seekstart(io); @time parsers(io, p); @time baseline(s);
  0.668701 seconds (9.81 M allocations: 165.713 MiB, 30.10% gc time)
  0.366405 seconds (3.00 M allocations: 77.665 MiB, 20.59% gc time)

quinnj · October 27, 2018, 2:28pm

Yeah, unfortunately you’re hitting a currently open issue involving float parsing performance when dealing with full-precision floats (i.e. > 15 digits, which is what you typically get with rand()). You can see the difference by rounding to 15 digits of precision:

julia> N=10^6; s = join(round.(randn(N), digits=15)," "); io = IOBuffer(s); seekstart(io); @time parsers(io, p); @time baseline(s);
  0.047656 seconds (24 allocations: 9.001 MiB)
  0.355156 seconds (3.00 M allocations: 77.665 MiB, 33.97% gc time)

quinnj · October 27, 2018, 3:04pm

Yes, how the Parsers.jl package helps here is allowing IO-based parsing. So in a case where you have a line in a file like “1,apples,oranges,6.78”, you could have something like:

function parserow(parser, io)
    a = Parsers.parse(parser, io, Int64).result
    b = Parsers.parse(parser, io, String).result
    c = Parsers.parse(parser, io, String).result
    d = Parsers.parse(parser, io, Float64).result
    return (col1=a, col2=b, col3=c, col4=d)
end

jonjilla · October 28, 2018, 3:25am

Thanks! I’ll give it a shot.

sergeant · September 2, 2021, 4:13pm

Hi, quinnj
Im trying to use Parsers as in example that you provided:

julia> using Parsers

julia> parser = Parsers.Delimited(" "; ignorerepeated=true)

I get error:

ERROR: UndefVarError: Delimited not defined
Stacktrace:
 [1] getproperty(x::Module, f::Symbol)
   @ Base .\Base.jl:26
 [2] top-level scope
   @ REPL[21]:1

Could you please point to what Im doing wrong here?

quinnj · September 7, 2021, 11:23pm

Hi @sergeant , the API for the Parsers.jl package went through a big upgrade; it now has a single Parsers.Options struct to hold any configuration options and you use Parsers.parse or Parsers.tryparse to pass a custom options config and get the parsing result back. So the original example I posted would be more like:

opts = Parsers.Options(delim=' ', ignorerepeated=true)
io = IOBuffer("1.0    2.34345              7.9")
Parsers.parse(Float64, io, opts) # returns 1.0
Parsers.parse(Float64, io, opts) # returns 2.34345
Parsers.parse(Float64, io, opts) # returns 7.9

sergeant · September 8, 2021, 10:25am

Thank you @quinnj !

jlapeyre · November 5, 2021, 2:46am

This fails for Parsers v2.1.1

julia> opts = Parsers.Options(delim=' ', ignorerepeated=true)
ERROR: ArgumentError: whitespace characters (`wh1=' '` and `wh2='\t'` default keyword arguments) must be different than delim argument

quinnj · November 5, 2021, 2:54am

Yeah, you’ll just need to do opts = Parsers.Options(delim=' ', ignorerepeated=true, wh1=0x00)

jlapeyre · November 5, 2021, 8:56pm

In the following, Parsers is faster than Base.parse for one line. But, when reading a file, it is much slower. (EDIT: Removed useless lines)

 """
    write_test_file(fname, nlines=10_000)

write `nlines` lines of data to a simplified dimacs file.
"""
function write_test_file(fname, nlines=10_000)
    nmax = 10_000
    open(fname, "w") do io
        for _ in 1:nlines
            (a, b, c) = rand(1:nmax, 3)
            println(io, "a ", a, " ", b, " ", c)
        end
    end
    return nothing
end

test_parse_line1(line::AbstractString, args...) = test_parse_line1(IOBuffer(line), args...)

"""
    test_parse_line1(io::IO, lineno=0)

Read three integers from a line of the form "a n1 n2 n3\n", consuming all characters.
The line taken from the front of the buffer `io`. Parsing is done by `Parsers`.
"""
function test_parse_line1(io::IO, lineno=0)
    char = read(io, Char)
    if char != 'a'
        throw(ErrorException("Expecting line beginning with 'a' in line $(lineno)"))
    end
    from_node = Parsers.parse(Int, io)
    to_node = Parsers.parse(Int, io)
    weight = Parsers.parse(Int, io)
    read(io, Char)
    return (from_node, to_node, weight)
end

"""
    test_parse_line2(io::IO, lineno=0)

Read three integers from a line of the form "a n1 n2 n3\n".
The line is an entire string.
"""
function test_parse_line2(line::AbstractString, lineno=0)
    char = line[1]
    if char != 'a'
        throw(ErrorException("Expecting line beginning with 'a' in line $(lineno)"))
    end
    (_, x, y, z) = split(line)
    from_node = parse(Int, x)
    to_node = parse(Int, y)
    weight = parse(Int, z)
    return (from_node, to_node, weight)
end

"""
    read_test_file2(path=path)

Read a file a line at a time into a string. Parse
each string into three integers.
"""
function read_test_file2(path=path)
    lcount = 0
    local res
    open(path, "r") do io
        while ! eof(io)
            line = readline(io)
            res = test_parse_line2(line, lcount)
            lcount += 1
        end
    end
    println(lcount)
    println(res)
end


"""
    read_test_file1(path=path)

Read lines from a file, parsing each line into three integers.
Pass over each character only once.
"""
function read_test_file1(path=path)
    lcount = 0
    local res
    open(path, "r") do io
        while ! eof(io)
            res = test_parse_line1(io, lcount)
            lcount += 1
        end
    end
    println(lcount)
    println(res)
end

julia> write_test_file("tfile.gr", 1_000_000)

julia> xx = "a 1234 223423 3234325\n";

julia> @btime test_parse_line1($xx)
  48.260 ns (2 allocations: 128 bytes)
(1234, 223423, 3234325)

julia> @btime test_parse_line2($xx)
  392.443 ns (2 allocations: 272 bytes)
(1234, 223423, 3234325)

julia> @time read_test_file1("tfile.gr")
1000000
1000000
(8274, 8890, 965)
  6.893523 seconds (3.01 M allocations: 61.368 MiB, 0.06% gc time, 0.32% compilation time)

julia> @time read_test_file2("tfile.gr")
1000000
1000000
(8274, 8890, 965)
  0.812866 seconds (6.00 M allocations: 362.354 MiB, 2.91% gc time, 2.30% compilation time)

jlapeyre · November 5, 2021, 10:35pm

I tried using test_parse_line1 in read_test_file2. (This requires removing the last read(io, Char) from test_parse_line1, which reads a newline, because it is already removed with readline.) Then the file is read and parsed much faster. The time to process the file is reduced from 0.812 s to 0.28 s. But, it still is wasteful because I am passing over the bytes twice. First read a line into a string with readline. Then wrap the string in an IOBuffer and send it to test_parse_line1, which uses Parser.

Topic		Replies	Views
Performance of splitting string and parsing numbers Performance	29	878	December 29, 2022
Parse vector from string General Usage strings , sparse	24	6416	March 27, 2023
Converting strings of numbers to numbers? New to Julia	22	35658	May 29, 2018
What is the fastest way to parse a string of numbers into a tuple / struct with different field types? General Usage	5	755	November 29, 2020
Julia 1.3 order of magnitude slowdown in function Performance	1	846	December 25, 2019

Fastest way to parse a string of numbers

Related topics