What is the fastest way to parse a string of numbers into a tuple / struct with different field types?

Suppose I have a large table whose rows look like

3 1 2.5 4.0 10.1

and I know for sure that the first two columns are Ints and the next three are Float64s.
Is there a way to use that information to write a fast parsing function?

So far, I’m coming to something like this:

function parseline(types, line)
    buf = IOBuffer(line)
    tokens = ntuple(length(types)) do _
        readuntil(buf, ' ')
    end
    map(parse, types, tokens)
end

julia> parseline((Int, Int, Float64, Float64, Float64), "3 1 2.5 4.0 10.1")
(3, 1, 2.5, 4.0, 10.1)

which works faster than

map((t, token)->parse(t, token), types, split(line))

Is there a fairly easy way to make the operation faster and type-stable?
In general, I don’t know how many columns the table has and which types they have until I read the header, so hardcoding the types is not an option.

1 Like

I think this has been already solved by https://github.com/JuliaData/CSV.jl

6 Likes

Thanks, that’s almost exactly what I need!

1 Like

If I have multiple tables per file (a sequence of frames from an MD simulation), is it possible to make CSV.jl stop reading after a certain number of lines? I’ve tried CSV.File with the limit keyword but it only seems to affect how many lines will be parsed, not how many will be read.

Since it implements the Tables.jl interface, you can get an iterator with Tables.rows and just collect that as many lines as you need.

1 Like

Thanks, tested that approach. It seems even slower than readline-split-parse (I guess because CSV.Row is not concretely-typed).
Another problem is that eager parsing of the whole file is not desireable, since a file may contain a long trajectory, and loading it whole may lead to OOM.

1 Like