I am trying to parse some CSV files that have some extra stuff and actually include multiple different tables. About the only thing that is standardized is that the tables within the file always have the same number of columns. Here is a test file:
This is an annoying way to make csv files.
Hello, here is a silly place to put a line with a comma.
Surpise: there is another table here
Made by me.
I’m looking for some advice on the best way to parse this programmatically. I do know that if I do something like
CSV.read(file;skipto=5,limit=3,header=5) then I can correctly get the first table in this case. However, how could I get those line numbers ahead of time?
I have no idea whether or not there is a nice prepackaged solution for a problem like this. Others may know better. The first idea off the top of my head for dealing with this (assuming that all the tables in the file have the same, known, number of columns:
Write your own code that reads the file line by line, splits each line by commas and identifies the start of a table by the number of items returned after splitting by commas. Then you could add that line and subsequent lines identified as part of the same table to an in memory string. Then use the
CSV package to parse your table stored as a single long string in memory. Then keep reading the file line by line until you find another table. Then repeat.
Just a thought off the top of my head.
The files are not extremely large so I can do something like
stringarray = readlines(file,keep=true)
without issue. In this case,
stringarray[5:7] contains the first table, but I cannot pass this to
string(stringarray[5:7]...) also does not work.
Wrapping it in an
IOBuffer does seem to work…
ios = IOBuffer(string(stringarray[5:7]...))
I am not sure what the spec is, but it seems to be
\n\n separating blocks that may or may not be valid CSV tables.
You could read the whole thing into a
String, find the block boundaries, then pass on the parts to CSV reader libraries, either wrapped in an
IOBuffer or similar, depending on how large the files are and whether you want to fiddle with making this fast.
If there is some rule to skip the comments (eg they are always just one line?) you can filter for them.
FWIW, it may be best to just ask the person generating these files to kindly stop doing it, if nesting multiple tables in a single file with metadata is necessary then CSV is not the best format for this (consider HDF5 or similar).
I concur. It is a pretty established data source though, so doubtful that it will change. I’m just trying to provide easier access to the data.
To provide some closure for this thread, I did figure out how to do it. Your suggestion more or less captures the bulk of the method. I’ll be posting the package announcement shortly.