What is the fastest way to parse a string of numbers into a tuple / struct with different field types?

Vasily_Pisarev · November 28, 2020, 6:23pm

Suppose I have a large table whose rows look like

3 1 2.5 4.0 10.1

and I know for sure that the first two columns are Ints and the next three are Float64s.
Is there a way to use that information to write a fast parsing function?

So far, I’m coming to something like this:

function parseline(types, line)
    buf = IOBuffer(line)
    tokens = ntuple(length(types)) do _
        readuntil(buf, ' ')
    end
    map(parse, types, tokens)
end

julia> parseline((Int, Int, Float64, Float64, Float64), "3 1 2.5 4.0 10.1")
(3, 1, 2.5, 4.0, 10.1)

which works faster than

map((t, token)->parse(t, token), types, split(line))

Is there a fairly easy way to make the operation faster and type-stable?
In general, I don’t know how many columns the table has and which types they have until I read the header, so hardcoding the types is not an option.

dpsanders · November 28, 2020, 6:36pm

I think this has been already solved by GitHub - JuliaData/CSV.jl: Utility library for working with CSV and other delimited files in the Julia programming language

Vasily_Pisarev · November 28, 2020, 7:57pm

Thanks, that’s almost exactly what I need!

Vasily_Pisarev · November 29, 2020, 7:42am

If I have multiple tables per file (a sequence of frames from an MD simulation), is it possible to make CSV.jl stop reading after a certain number of lines? I’ve tried CSV.File with the limit keyword but it only seems to affect how many lines will be parsed, not how many will be read.

Tamas_Papp · November 29, 2020, 8:23am

Since it implements the Tables.jl interface, you can get an iterator with Tables.rows and just collect that as many lines as you need.

Vasily_Pisarev · November 29, 2020, 1:51pm

Thanks, tested that approach. It seems even slower than readline-split-parse (I guess because CSV.Row is not concretely-typed).
Another problem is that eager parsing of the whole file is not desireable, since a file may contain a long trajectory, and loading it whole may lead to OOM.

Topic		Replies	Views
Parsing strange CSV files General Usage question , csv	5	628	March 13, 2020
How to parse the string entries read from a file into their struct / types? General Usage	5	429	December 9, 2019
CSV.jl : how to specify the columns types when the total columns number is not known? Data	8	3195	February 18, 2020
Fastest way to parse a string of numbers General Usage question , parsing	21	9326	April 29, 2022
Making string to float conversion faster? General Usage	16	1150	March 14, 2021

What is the fastest way to parse a string of numbers into a tuple / struct with different field types?

Related topics