I’ve been using DataFrames.readtable() to read files with fixed width columns, but that function is now deprecated in favor of CSV.read(). However it seems that CSV.read() lacks the flag to “treat consecutive whitespace delimiters as one” that would be required to make it handle fixed width data.
Here’s a sample file from climate science (standardized scenarios of greenhouse gas concentrations called “Representative Concentration Pathways”). More similar files available here. DataFrames.readtable() has no problems with this file:
Is there no standard way to read files with fixed width columns in Julia 0.6.1 (or 0.7 or 1.0)? If so that seems to be a pretty serious shortcoming. I know it’s not too hard to write a fixed width file reader from scratch, but I still think this functionality should be included by default. Or am I missing something?
Using delimited data tools may be a workaround, but they are not right for the job.
It is very easy to roll your own tools for fixed data, see this topic:
That said, fixed width datasets are usually the ugliest to deal with. In the best cases, there is some metadata which tells you the first line of data and the column layout, but there is not widely used standard format (social security data sometimes uses something called DDF, but that is not a well-defined format anyway).
I would simply try to get the data in another format, or use a command line tool like sed to replace contiguous whitespace with , or similar, then read it as CSV.
Here is quick workaround (probably not a general solution), but it works for me (Julia 0.6.1) using the sample file above:
function readfwf(x::String; opts...)
d = readdlm(x; opts...)
z = !isa(d,Tuple) ? DataFrame(d) : names!(DataFrame(d[1]), Symbol.(vec(d[2])))
return z
end
x = "RCP3PD_MIDYEAR_CONCENTRATIONS.DAT"
d = readfwf(x, header=true, skipstart=38)
Thanks for the suggestions guys, but I had already solved my problem when I posted (not as elegantly as @Mattriks’ little function though). I was just surprised that some of the functionality of readtable() had fallen through the cracks during the transition to the new DataFrames version and wanted to shine some light on it.
Parsing fixed width columns depending on a variable number of spaces could create (sometimes silent) bugs when there is no space between fields. In the best scenario, this will end up in an error (e.g., an error for having a shorter row).