Is there no standard way to read files with fixed width columns in the new DataFrames ecosystem?


#1

I’ve been using DataFrames.readtable() to read files with fixed width columns, but that function is now deprecated in favor of CSV.read(). However it seems that CSV.read() lacks the flag to “treat consecutive whitespace delimiters as one” that would be required to make it handle fixed width data.

Here’s a sample file from climate science (standardized scenarios of greenhouse gas concentrations called “Representative Concentration Pathways”). More similar files available here. DataFrames.readtable() has no problems with this file:

readtable("RCP3PD_MIDYEAR_CONCENTRATIONS.DAT"; separator=' ', skipstart=38)

But CSV.read() fails since it interprets every whitespace as delimiting a new column.

CSV.read("RCP3PD_MIDYEAR_CONCENTRATIONS.DAT"; delim=' ', header=39)

Is there no standard way to read files with fixed width columns in Julia 0.6.1 (or 0.7 or 1.0)? If so that seems to be a pretty serious shortcoming. I know it’s not too hard to write a fixed width file reader from scratch, but I still think this functionality should be included by default. Or am I missing something?


Ignore consecutive whitespaces with CSV.read(...)
#2

CSV.jl hasn’t been updated yet. JuliaData I/O packages like CSV.jl and Feather.jl are currently being updated.


#3

Using delimited data tools may be a workaround, but they are not right for the job.

It is very easy to roll your own tools for fixed data, see this topic:

That said, fixed width datasets are usually the ugliest to deal with. In the best cases, there is some metadata which tells you the first line of data and the column layout, but there is not widely used standard format (social security data sometimes uses something called DDF, but that is not a well-defined format anyway).

I would simply try to get the data in another format, or use a command line tool like sed to replace contiguous whitespace with , or similar, then read it as CSV.


Julia Support for File Loading
#4

Indeed, reading fixed-width files as CSV may not be the best approach. We need a dedicated function for that.


#5

Here is quick workaround (probably not a general solution), but it works for me (Julia 0.6.1) using the sample file above:

function readfwf(x::String; opts...) 
    d = readdlm(x; opts...)
    z = !isa(d,Tuple) ? DataFrame(d) : names!(DataFrame(d[1]), Symbol.(vec(d[2])))
    return z
end    

x = "RCP3PD_MIDYEAR_CONCENTRATIONS.DAT"
d = readfwf(x, header=true, skipstart=38)

#6

Thanks for the suggestions guys, but I had already solved my problem when I posted (not as elegantly as @Mattriks’ little function though). I was just surprised that some of the functionality of readtable() had fallen through the cracks during the transition to the new DataFrames version and wanted to shine some light on it.


#7

Parsing fixed width columns depending on a variable number of spaces could create (sometimes silent) bugs when there is no space between fields. In the best scenario, this will end up in an error (e.g., an error for having a shorter row).