Parsing strange CSV files

tbeason · March 12, 2020, 10:56pm

I am trying to parse some CSV files that have some extra stuff and actually include multiple different tables. About the only thing that is standardized is that the tables within the file always have the same number of columns. Here is a test file:

This is an annoying way to make csv files.

Hello, here is a silly place to put a line with a comma.

,col2,col3
20010501,1.0,-0.5
20010502,0.1,0.5

Surpise: there is another table here
,col2,col3
20200101,1.0,-0.5
20200102,0.1,0.5

Made by me.

I’m looking for some advice on the best way to parse this programmatically. I do know that if I do something like CSV.read(file;skipto=5,limit=3,header=5) then I can correctly get the first table in this case. However, how could I get those line numbers ahead of time?

Pbellive · March 12, 2020, 11:28pm

I have no idea whether or not there is a nice prepackaged solution for a problem like this. Others may know better. The first idea off the top of my head for dealing with this (assuming that all the tables in the file have the same, known, number of columns:

Write your own code that reads the file line by line, splits each line by commas and identifies the start of a table by the number of items returned after splitting by commas. Then you could add that line and subsequent lines identified as part of the same table to an in memory string. Then use the CSV package to parse your table stored as a single long string in memory. Then keep reading the file line by line until you find another table. Then repeat.

Just a thought off the top of my head.

tbeason · March 12, 2020, 11:39pm

The files are not extremely large so I can do something like

stringarray = readlines(file,keep=true)

without issue. In this case, stringarray[5:7] contains the first table, but I cannot pass this to CSV and string(stringarray[5:7]...) also does not work.

tbeason · March 13, 2020, 12:16am

Wrapping it in an IOBuffer does seem to work…

ios = IOBuffer(string(stringarray[5:7]...))
CSV.read(ios)

Tamas_Papp · March 13, 2020, 9:19am

I am not sure what the spec is, but it seems to be \n\n separating blocks that may or may not be valid CSV tables.

You could read the whole thing into a String, find the block boundaries, then pass on the parts to CSV reader libraries, either wrapped in an IOBuffer or similar, depending on how large the files are and whether you want to fiddle with making this fast.

If there is some rule to skip the comments (eg they are always just one line?) you can filter for them.

FWIW, it may be best to just ask the person generating these files to kindly stop doing it, if nesting multiple tables in a single file with metadata is necessary then CSV is not the best format for this (consider HDF5 or similar).

tbeason · March 13, 2020, 10:58pm

I concur. It is a pretty established data source though, so doubtful that it will change. I’m just trying to provide easier access to the data.

To provide some closure for this thread, I did figure out how to do it. Your suggestion more or less captures the bulk of the method. I’ll be posting the package announcement shortly.

Topic		Replies	Views
Best way to parse custom formatted text file General Usage question	3	1009	April 10, 2020
Parsing a bespoke file format General Usage question , csv	5	644	September 21, 2021
Reading headers of delimited files General Usage csv	1	583	January 3, 2023
What is the fastest way to parse a string of numbers into a tuple / struct with different field types? General Usage	5	755	November 29, 2020
CSV won't read tab separated file General Usage csv	23	651	March 4, 2024

Parsing strange CSV files

Related topics