Fastest way to parse a string of numbers

question

#1

I have a string of space separated numbers like this:

"1.0    2.34345              7.9"

And I want to parse it. Currently I use the following code for this:

function parse_numbers(s)
    pieces = split(s, ' ', keepempty=false)
    map(pieces) do piece
        parse(Float64, piece)
    end
end

using BenchmarkTools
s = join(randn(1000), "  ")

@benchmark parse_numbers(s)
BenchmarkTools.Trial: 
  memory estimate:  116.59 KiB
  allocs estimate:  4912
  --------------
  minimum time:     335.254 μs (0.00% GC)
  median time:      342.025 μs (0.00% GC)
  mean time:        359.650 μs (2.53% GC)
  maximum time:     2.430 ms (82.97% GC)
  --------------
  samples:          10000
  evals/sample:     1

Is there a faster way?


#2
using BenchmarkTools

function parse_numbers(s)
    pieces = split(s, ' ', keepempty=false)
    map(pieces) do piece
        parse(Float64, piece)
    end
end

function parse_numbers2(s)
    matches = eachmatch(r"-?\d+\.?\d*", s)
    gen = (parse(Float64, m.match) for m in matches)
    collect(gen)
end

s = "1.0    2.34345              7.9"

@btime parse_numbers($s);
# 2.553 μs
@btime parse_numbers2($s);
# 2.066 μs

#3

It seems however, that this does not scale as well:

julia> s = join(randn(1000), ' ');

julia> @btime parse_numbers($s);
  279.502 μs (2959 allocations: 86.08 KiB)

julia> @btime parse_numbers2($s);
  402.775 μs (4013 allocations: 250.95 KiB)

#4

The Parsers.jl package is setup to handle these kind of parsing scenarios quite well. In this case, you can do something like:

parser = Parsers.Delimited(" "; ignorerepeated=true)
io = IOBuffer("1.0    2.34345              7.9")
Parsers.parse(parser, io, Float64) # returns 1.0
Parsers.parse(parser, io, Float64) # returns 2.34345
Parsers.parse(parser, io, Float64) # returns 7.9

It’s a bit tricky to benchmark because it relies on parsing from an IOBuffer, but it’s also very fast because it’s making exactly one pass over the strings in parsing all three numbers.

Let me know if you have any questions or concerns.


Help with string mapping from input
#5

From

Parsers.parse(parser, io, Float64)

I get

Parsers.Result{Float64}(1.0, 9, 0).

How do I access 1.0?


#6
julia> fieldnames(Parsers.Result)
(:result, :code, :pos)
julia> a = Parsers.Result{Float64}(1.0, 9, 0);
julia> a.result
1.0

#7

Can this be used to parse a string of delimited mixed types like Integers, Floats and Strings?


#8

Sure thing. Instead of Parsers.parse(parser, io, Float64, just substitute Parsers.parse(parser, io, Int64) or Parsers.parse(parser, io, String).


#9

I was wondering more about a mixed string like in this question which hasn’t been answered yet

For exampe, a string of mixed types like “1,apples,oranges,4,5,6.78,bananas”. Is there a way to map it into

Int,String,String,Int,Int,Float,String


#10

It seems that the parsers solution is in fact slower then the baseline

julia> using Parsers

julia> function baseline(s)
           pieces = split(s, ' ', keepempty=false)
           map(pieces) do piece
               parse(Float64, piece)
           end
       end
baseline (generic function with 1 method)

julia> function parsers(io::IO, p)
           ret = Float64[]
           while !eof(io)
               x = Parsers.parse(p, io, Float64)
               push!(ret, x.result)
           end
           ret
       end
parsers (generic function with 1 method)

julia> p = Parsers.Delimited(" ", ignorerepeated=true);

julia> N=10^6; s = join(randn(N)," "); io = IOBuffer(s); seekstart(io); @time parsers(io, p); @time baseline(s);
  1.507161 seconds (12.82 M allocations: 317.485 MiB, 12.21% gc time)
  0.406526 seconds (3.29 M allocations: 91.532 MiB)

julia> N=10^6; s = join(randn(N)," "); io = IOBuffer(s); seekstart(io); @time parsers(io, p); @time baseline(s);
  0.428294 seconds (9.79 M allocations: 169.320 MiB)
  0.304757 seconds (3.00 M allocations: 77.665 MiB)

julia> N=10^6; s = join(randn(N)," "); io = IOBuffer(s); seekstart(io); @time parsers(io, p); @time baseline(s);
  0.683015 seconds (9.83 M allocations: 165.939 MiB, 30.17% gc time)
  0.371718 seconds (3.00 M allocations: 77.665 MiB, 21.19% gc time)

julia> N=10^6; s = join(randn(N)," "); io = IOBuffer(s); seekstart(io); @time parsers(io, p); @time baseline(s);
  0.668701 seconds (9.81 M allocations: 165.713 MiB, 30.10% gc time)
  0.366405 seconds (3.00 M allocations: 77.665 MiB, 20.59% gc time)

#11

Yeah, unfortunately you’re hitting a currently open issue involving float parsing performance when dealing with full-precision floats (i.e. > 15 digits, which is what you typically get with rand()). You can see the difference by rounding to 15 digits of precision:

julia> N=10^6; s = join(round.(randn(N), digits=15)," "); io = IOBuffer(s); seekstart(io); @time parsers(io, p); @time baseline(s);
  0.047656 seconds (24 allocations: 9.001 MiB)
  0.355156 seconds (3.00 M allocations: 77.665 MiB, 33.97% gc time)

#12

Yes, how the Parsers.jl package helps here is allowing IO-based parsing. So in a case where you have a line in a file like “1,apples,oranges,6.78”, you could have something like:

function parserow(parser, io)
    a = Parsers.parse(parser, io, Int64).result
    b = Parsers.parse(parser, io, String).result
    c = Parsers.parse(parser, io, String).result
    d = Parsers.parse(parser, io, Float64).result
    return (col1=a, col2=b, col3=c, col4=d)
end

#13

Thanks! I’ll give it a shot.