Parsing a bespoke file format

I have a text file that has the following structure

begin"head"
    <head1 content>
end"head"
begin"body"
    <body1 content>
end"body"
begin"foot"
    <foot1 content>
end"foot"
begin"head"
    <head2 content>
end"head"
begin"body"
    <body2 content>
end"body"
begin"foot"
    <foot2 content>
end"foot"
...

Such files are collections of multiple datasets, where one dataset is composed of one head, body and foot.
Obviously, such files should be handled in a concurrent manner, each dataset independently.

Is there a way to tell CSV to split the file in chunks at all the begin"head" ?

Not sure why CSV.jl should read such type of file but, why not considering writing your own file parser?

Agree, this file looks very non-CSV in format. Unclear if data is even tabular.

2 Likes

@rafael.guerra @StefanKarpinski, yeah sorry, I hid the only part of the file that looks like a CSV.
Basically, anything inside a begin ... end is composed of items separated with " (yes somebody thought of that character for a separator…).

My idea was to use CSV with the transposition option and get rid of all the begin...end. But as each line has a different amount of information (there is no consistency in columns counts) it may not be suited for such files indeed.

I just figured CSV was well developed now and might be of help for this.

I think this is different enough from CSV that trying to get a CSV reader to process it will be much more pain than help. I would write a loop that processes the file format line by line using regular expressions and split to extract the data.

1 Like

All right, thanks for the advice !