It seems however, that this does not scale as well:
julia> s = join(randn(1000), ' ');
julia> @btime parse_numbers($s);
279.502 μs (2959 allocations: 86.08 KiB)
julia> @btime parse_numbers2($s);
402.775 μs (4013 allocations: 250.95 KiB)
It seems however, that this does not scale as well:
julia> s = join(randn(1000), ' ');
julia> @btime parse_numbers($s);
279.502 μs (2959 allocations: 86.08 KiB)
julia> @btime parse_numbers2($s);
402.775 μs (4013 allocations: 250.95 KiB)
The Parsers.jl package is setup to handle these kind of parsing scenarios quite well. In this case, you can do something like:
parser = Parsers.Delimited(" "; ignorerepeated=true)
io = IOBuffer("1.0 2.34345 7.9")
Parsers.parse(parser, io, Float64) # returns 1.0
Parsers.parse(parser, io, Float64) # returns 2.34345
Parsers.parse(parser, io, Float64) # returns 7.9
It’s a bit tricky to benchmark because it relies on parsing from an IOBuffer
, but it’s also very fast because it’s making exactly one pass over the strings in parsing all three numbers.
Let me know if you have any questions or concerns.
From
Parsers.parse(parser, io, Float64)
I get
Parsers.Result{Float64}(1.0, 9, 0)
.
How do I access 1.0
?
julia> fieldnames(Parsers.Result)
(:result, :code, :pos)
julia> a = Parsers.Result{Float64}(1.0, 9, 0);
julia> a.result
1.0
Can this be used to parse a string of delimited mixed types like Integers, Floats and Strings?
Sure thing. Instead of Parsers.parse(parser, io, Float64
, just substitute Parsers.parse(parser, io, Int64)
or Parsers.parse(parser, io, String)
.
I was wondering more about a mixed string like in this question which hasn’t been answered yet
For exampe, a string of mixed types like “1,apples,oranges,4,5,6.78,bananas”. Is there a way to map it into
Int,String,String,Int,Int,Float,String
It seems that the parsers solution is in fact slower then the baseline
julia> using Parsers
julia> function baseline(s)
pieces = split(s, ' ', keepempty=false)
map(pieces) do piece
parse(Float64, piece)
end
end
baseline (generic function with 1 method)
julia> function parsers(io::IO, p)
ret = Float64[]
while !eof(io)
x = Parsers.parse(p, io, Float64)
push!(ret, x.result)
end
ret
end
parsers (generic function with 1 method)
julia> p = Parsers.Delimited(" ", ignorerepeated=true);
julia> N=10^6; s = join(randn(N)," "); io = IOBuffer(s); seekstart(io); @time parsers(io, p); @time baseline(s);
1.507161 seconds (12.82 M allocations: 317.485 MiB, 12.21% gc time)
0.406526 seconds (3.29 M allocations: 91.532 MiB)
julia> N=10^6; s = join(randn(N)," "); io = IOBuffer(s); seekstart(io); @time parsers(io, p); @time baseline(s);
0.428294 seconds (9.79 M allocations: 169.320 MiB)
0.304757 seconds (3.00 M allocations: 77.665 MiB)
julia> N=10^6; s = join(randn(N)," "); io = IOBuffer(s); seekstart(io); @time parsers(io, p); @time baseline(s);
0.683015 seconds (9.83 M allocations: 165.939 MiB, 30.17% gc time)
0.371718 seconds (3.00 M allocations: 77.665 MiB, 21.19% gc time)
julia> N=10^6; s = join(randn(N)," "); io = IOBuffer(s); seekstart(io); @time parsers(io, p); @time baseline(s);
0.668701 seconds (9.81 M allocations: 165.713 MiB, 30.10% gc time)
0.366405 seconds (3.00 M allocations: 77.665 MiB, 20.59% gc time)
Yeah, unfortunately you’re hitting a currently open issue involving float parsing performance when dealing with full-precision floats (i.e. > 15 digits, which is what you typically get with rand()). You can see the difference by rounding to 15 digits of precision:
julia> N=10^6; s = join(round.(randn(N), digits=15)," "); io = IOBuffer(s); seekstart(io); @time parsers(io, p); @time baseline(s);
0.047656 seconds (24 allocations: 9.001 MiB)
0.355156 seconds (3.00 M allocations: 77.665 MiB, 33.97% gc time)
Yes, how the Parsers.jl package helps here is allowing IO-based parsing. So in a case where you have a line in a file like “1,apples,oranges,6.78”, you could have something like:
function parserow(parser, io)
a = Parsers.parse(parser, io, Int64).result
b = Parsers.parse(parser, io, String).result
c = Parsers.parse(parser, io, String).result
d = Parsers.parse(parser, io, Float64).result
return (col1=a, col2=b, col3=c, col4=d)
end
Thanks! I’ll give it a shot.
Hi, quinnj
Im trying to use Parsers as in example that you provided:
julia> using Parsers
julia> parser = Parsers.Delimited(" "; ignorerepeated=true)
I get error:
ERROR: UndefVarError: Delimited not defined
Stacktrace:
[1] getproperty(x::Module, f::Symbol)
@ Base .\Base.jl:26
[2] top-level scope
@ REPL[21]:1
Could you please point to what Im doing wrong here?
Hi @sergeant , the API for the Parsers.jl package went through a big upgrade; it now has a single Parsers.Options
struct to hold any configuration options and you use Parsers.parse
or Parsers.tryparse
to pass a custom options config and get the parsing result back. So the original example I posted would be more like:
opts = Parsers.Options(delim=' ', ignorerepeated=true)
io = IOBuffer("1.0 2.34345 7.9")
Parsers.parse(Float64, io, opts) # returns 1.0
Parsers.parse(Float64, io, opts) # returns 2.34345
Parsers.parse(Float64, io, opts) # returns 7.9
This fails for Parsers
v2.1.1
julia> opts = Parsers.Options(delim=' ', ignorerepeated=true)
ERROR: ArgumentError: whitespace characters (`wh1=' '` and `wh2='\t'` default keyword arguments) must be different than delim argument
Yeah, you’ll just need to do opts = Parsers.Options(delim=' ', ignorerepeated=true, wh1=0x00)
In the following, Parsers
is faster than Base.parse
for one line. But, when reading a file, it is much slower. (EDIT: Removed useless lines)
"""
write_test_file(fname, nlines=10_000)
write `nlines` lines of data to a simplified dimacs file.
"""
function write_test_file(fname, nlines=10_000)
nmax = 10_000
open(fname, "w") do io
for _ in 1:nlines
(a, b, c) = rand(1:nmax, 3)
println(io, "a ", a, " ", b, " ", c)
end
end
return nothing
end
test_parse_line1(line::AbstractString, args...) = test_parse_line1(IOBuffer(line), args...)
"""
test_parse_line1(io::IO, lineno=0)
Read three integers from a line of the form "a n1 n2 n3\n", consuming all characters.
The line taken from the front of the buffer `io`. Parsing is done by `Parsers`.
"""
function test_parse_line1(io::IO, lineno=0)
char = read(io, Char)
if char != 'a'
throw(ErrorException("Expecting line beginning with 'a' in line $(lineno)"))
end
from_node = Parsers.parse(Int, io)
to_node = Parsers.parse(Int, io)
weight = Parsers.parse(Int, io)
read(io, Char)
return (from_node, to_node, weight)
end
"""
test_parse_line2(io::IO, lineno=0)
Read three integers from a line of the form "a n1 n2 n3\n".
The line is an entire string.
"""
function test_parse_line2(line::AbstractString, lineno=0)
char = line[1]
if char != 'a'
throw(ErrorException("Expecting line beginning with 'a' in line $(lineno)"))
end
(_, x, y, z) = split(line)
from_node = parse(Int, x)
to_node = parse(Int, y)
weight = parse(Int, z)
return (from_node, to_node, weight)
end
"""
read_test_file2(path=path)
Read a file a line at a time into a string. Parse
each string into three integers.
"""
function read_test_file2(path=path)
lcount = 0
local res
open(path, "r") do io
while ! eof(io)
line = readline(io)
res = test_parse_line2(line, lcount)
lcount += 1
end
end
println(lcount)
println(res)
end
"""
read_test_file1(path=path)
Read lines from a file, parsing each line into three integers.
Pass over each character only once.
"""
function read_test_file1(path=path)
lcount = 0
local res
open(path, "r") do io
while ! eof(io)
res = test_parse_line1(io, lcount)
lcount += 1
end
end
println(lcount)
println(res)
end
julia> write_test_file("tfile.gr", 1_000_000)
julia> xx = "a 1234 223423 3234325\n";
julia> @btime test_parse_line1($xx)
48.260 ns (2 allocations: 128 bytes)
(1234, 223423, 3234325)
julia> @btime test_parse_line2($xx)
392.443 ns (2 allocations: 272 bytes)
(1234, 223423, 3234325)
julia> @time read_test_file1("tfile.gr")
1000000
1000000
(8274, 8890, 965)
6.893523 seconds (3.01 M allocations: 61.368 MiB, 0.06% gc time, 0.32% compilation time)
julia> @time read_test_file2("tfile.gr")
1000000
1000000
(8274, 8890, 965)
0.812866 seconds (6.00 M allocations: 362.354 MiB, 2.91% gc time, 2.30% compilation time)
I tried using test_parse_line1
in read_test_file2
. (This requires removing the last read(io, Char)
from test_parse_line1
, which reads a newline, because it is already removed with readline
.) Then the file is read and parsed much faster. The time to process the file is reduced from 0.812 s to 0.28 s. But, it still is wasteful because I am passing over the bytes twice. First read a line into a string with readline
. Then wrap the string in an IOBuffer
and send it to test_parse_line1
, which uses Parser
.
You should look into using Parsers.xparse
directly, since it will avoid some overhead of using Parsers.parse
. It also respects a Parsers.Options
struct where you can pass the ' '
space delimiter which will be consumed automatically, as well as handling newlines. You could check out a recent example of how to do this in the upcoming PowerFlowData.jl package. Basically, Parsers.xparse
is what Parsers.parse
calls under the hood. It’s most efficient if you pass it a raw vector of bytes, either by calling read(filepath)
or Mmap.mmap(filepath)
. You get back a Parsers.Result{T}
object from calling Parsers.xparse
which gives you a code
which will signify if parsing succeeded, if a newline was encountered, etc; a val
, which is the actual parsed value, and tlen
, which is the total number of bytes consumed while parsing (including any delimiters). So the general usage is like:
function parsestuff(file)
buf = read(file)
len = length(buf)
pos = 1
opts = Parsers.Options(delim=' ', wh1=0x00)
while pos <= len
res = Parsers.xparse(Int, buf, pos, len, opts)
if Parsers.ok(res.code)
# parsing succeeded, do stuff with res.val
end
pos += res.tlen
end
end
opts = Parsers.Options(delim=' ', ignorerepeated=true, wh1=0x00)
io = IOBuffer("1.0 2.34345 7.9")
Parsers.parse(Float64, io, opts) # returns 1.0
Parsers.parse(Float64, io, opts) # returns 2.34345
Parsers.parse(Float64, io, opts) # returns 7.9
This is not working with the v2.3.1 update! I will try to figure out why not. Some help is appreciated.