Suppose I have a large table whose rows look like
3 1 2.5 4.0 10.1
and I know for sure that the first two columns are
Ints and the next three are
Is there a way to use that information to write a fast parsing function?
So far, I’m coming to something like this:
function parseline(types, line)
buf = IOBuffer(line)
tokens = ntuple(length(types)) do _
readuntil(buf, ' ')
map(parse, types, tokens)
julia> parseline((Int, Int, Float64, Float64, Float64), "3 1 2.5 4.0 10.1")
(3, 1, 2.5, 4.0, 10.1)
which works faster than
map((t, token)->parse(t, token), types, split(line))
Is there a fairly easy way to make the operation faster and type-stable?
In general, I don’t know how many columns the table has and which types they have until I read the header, so hardcoding the types is not an option.
I think this has been already solved by https://github.com/JuliaData/CSV.jl
Thanks, that’s almost exactly what I need!
If I have multiple tables per file (a sequence of frames from an MD simulation), is it possible to make CSV.jl stop reading after a certain number of lines? I’ve tried
CSV.File with the
limit keyword but it only seems to affect how many lines will be parsed, not how many will be read.
Since it implements the Tables.jl interface, you can get an iterator with
Tables.rows and just collect that as many lines as you need.
Thanks, tested that approach. It seems even slower than readline-split-parse (I guess because
CSV.Row is not concretely-typed).
Another problem is that eager parsing of the whole file is not desireable, since a file may contain a long trajectory, and loading it whole may lead to OOM.