I frequently have to read CSVs that mix headers and data. Currently i have a hacky workflow, which could use some help. The header information needs to be parsed, and cannot be discarded.
The files take two general formats:
A: a few lines of header information, with variable number of columns, followed by a section of data with a fixed number of columns.
B: A repeating pattern of header and data sections. something like 10 rows of header, 200 rows of data, 10 rows of header, 200 rows of data…
Current methodology
I’ve dabbled with CSV and DelimitedFiles, but haven’t figures out how to read header lines without discarding them. My current strategy is to run a file IO loop to get the header, then close the file, reopen it with CSV and parse it. This works for the type A files, but i haven’t figured out how to parse the type B files yet.
f = open(filename)
header = [readline(f) for _ in 1:7]
close(f)
## some lines of header parsing, sometimes reading the start row and dimensions of the data section and feeding those parameters to the CSV call
data = CSV.read(filename, DataFrame, skipto=8, header=false)
Questions:
- Is there a way to pass a vector of strings to CSV or DataFrames? Then i’d use readlines to parse the whole file, then pass some portion of the realines vector to CSV or delimitedfiles for parsing.
- is there a way to chunk a file into sections in CSV or delimitedfiles? My experience is that unless i completely isolate the data section ( which i do by dropping lines), the auto-parser is confused by the change of format and will have too few columns and error out.
This is all hackable, and I’ll get through it eventually, though probably in a more manual method than i’d prefer. I figured out the readline method from a blog post from 7 years ago, and general fileIO makes sense for where that is but i was just focused on the delimited packages.
Thanks