Reading headers of delimited files

I frequently have to read CSVs that mix headers and data. Currently i have a hacky workflow, which could use some help. The header information needs to be parsed, and cannot be discarded.

The files take two general formats:
A: a few lines of header information, with variable number of columns, followed by a section of data with a fixed number of columns.
B: A repeating pattern of header and data sections. something like 10 rows of header, 200 rows of data, 10 rows of header, 200 rows of data…

Current methodology

I’ve dabbled with CSV and DelimitedFiles, but haven’t figures out how to read header lines without discarding them. My current strategy is to run a file IO loop to get the header, then close the file, reopen it with CSV and parse it. This works for the type A files, but i haven’t figured out how to parse the type B files yet.

f = open(filename)
header = [readline(f) for _ in 1:7]
close(f)
## some lines of header parsing, sometimes reading the start row and dimensions of the data section and feeding those parameters to the CSV call
data = CSV.read(filename, DataFrame, skipto=8, header=false)

Questions:

  1. Is there a way to pass a vector of strings to CSV or DataFrames? Then i’d use readlines to parse the whole file, then pass some portion of the realines vector to CSV or delimitedfiles for parsing.
  2. is there a way to chunk a file into sections in CSV or delimitedfiles? My experience is that unless i completely isolate the data section ( which i do by dropping lines), the auto-parser is confused by the change of format and will have too few columns and error out.

This is all hackable, and I’ll get through it eventually, though probably in a more manual method than i’d prefer. I figured out the readline method from a blog post from 7 years ago, and general fileIO makes sense for where that is but i was just focused on the delimited packages.

Thanks

Better to just pass the f object directly to CSV.read, since that will begin reading from the current position in the stream:

f = open(filename)
header = first(eachline(f), 7) # read 7 header lines
data = CSV.read(f, DataFrame, header=false)

More generally, you can take any string s (or vector of bytes), representing any chunk of the file you want to read from, and call CSV.read on IOBuffer(s).

Sure, just isolate the section you want and wrap an IOBuffer around it.

I would try to avoid reading the file line-by-line if you can. For example, you could mmap the whole file, then wrap an IOBuffer around a @view of the section of the file you want to read from. You can use StringViews.jl to perform regular-expression searches of a memory-mapped file, or just search for bytes like UInt8('\n') (line endings).

1 Like