Best way to parse custom formatted text file

Hello

I’m trying to parse a custom text file. It looks like the CSV.File command works well, but because of the custom nature of the file, that has a lot of extraneous text, I have to first read the file into memory, figure out which rows in the file contain the data that I need, and then pass the file to CSV.File to import, which then works well, but it requires me to read the file twice, once to identify the custom tags to indicate where the data is, and then again through CSV.File

I wonder if it’s possible to pass the strings that are read from the first read, into CSV.Files to process the rest the data the rest of the way.

The data looks something like this

By the way, this table is many hundreds of millions of rows. I tried to write a custom parser, but can’t get anywhere close to the performance of CSV.File so that’s why I’m going this route, but if there is another way to process this, suggestions would be appreciated.

I only want the data from TABLEX in this case, not TABLEY

STARTHEADER:
x
t
g
d
s
TABLEX
TABLESTART:
a\tb\tc
integer\tinteger\integer\tstring
1\t2\t3\ta
4\t5\t\6\tx
TABLEEND:
x
c
v
f
g
s
TABLEY
TABLESTART:
z\ty\tx
string\tstring\integer
a\tb\t1
x\ty\t2
TABLEEND_X:

CSV.File apparently takes an IO stream. Which means you could create a structure like:

struct Parser
    file::IO
    Parser(file) = new(open(file, "r+"))
end

Then implement the various IO methods (read, close, etc…) for it. Most would just call the Base function. However your read method would have to read a line of text. Determine if it’s valid, if it is return that line, otherwise read the next line and verify that, rinse and repeat.

1 Like

So I just dealt with a similar issue, you can see what I did here https://github.com/tbeason/FamaFrenchData.jl

Basically I read the entire file and determine where the tables live and then pass those sections to CSV.jl. See the parsefile function in particular. It might not be the most optimized but the files in my case are not excessively large anyway.

Your problem might even be a bit easier if that is what your file looks like. You have obvious keywords to search for (TABLEX,TABLESTART:, TABLEEND:). You could scrap a lot of the code that I had to write in that case.

1 Like

thanks for the suggestions. I haven’t had time to get back to this project, but I’ll feedback as soon as I get back to it. I always appreciate the quick responses from this forum!