Is there no standard way to read files with fixed width columns in the new DataFrames ecosystem?

NickNack · November 26, 2017, 5:39pm

I’ve been using DataFrames.readtable() to read files with fixed width columns, but that function is now deprecated in favor of CSV.read(). However it seems that CSV.read() lacks the flag to “treat consecutive whitespace delimiters as one” that would be required to make it handle fixed width data.

Here’s a sample file from climate science (standardized scenarios of greenhouse gas concentrations called “Representative Concentration Pathways”). More similar files available here. DataFrames.readtable() has no problems with this file:

readtable("RCP3PD_MIDYEAR_CONCENTRATIONS.DAT"; separator=' ', skipstart=38)

But CSV.read() fails since it interprets every whitespace as delimiting a new column.

CSV.read("RCP3PD_MIDYEAR_CONCENTRATIONS.DAT"; delim=' ', header=39)

Is there no standard way to read files with fixed width columns in Julia 0.6.1 (or 0.7 or 1.0)? If so that seems to be a pretty serious shortcoming. I know it’s not too hard to write a fixed width file reader from scratch, but I still think this functionality should be included by default. Or am I missing something?

Nosferican · November 26, 2017, 10:42pm

CSV.jl hasn’t been updated yet. JuliaData I/O packages like CSV.jl and Feather.jl are currently being updated.

Tamas_Papp · November 27, 2017, 6:43am

Using delimited data tools may be a workaround, but they are not right for the job.

It is very easy to roll your own tools for fixed data, see this topic:

That said, fixed width datasets are usually the ugliest to deal with. In the best cases, there is some metadata which tells you the first line of data and the column layout, but there is not widely used standard format (social security data sometimes uses something called DDF, but that is not a well-defined format anyway).

I would simply try to get the data in another format, or use a command line tool like sed to replace contiguous whitespace with , or similar, then read it as CSV.

nalimilan · November 27, 2017, 9:57am

Indeed, reading fixed-width files as CSV may not be the best approach. We need a dedicated function for that.

Mattriks · November 27, 2017, 12:02pm

Here is quick workaround (probably not a general solution), but it works for me (Julia 0.6.1) using the sample file above:

function readfwf(x::String; opts...) 
    d = readdlm(x; opts...)
    z = !isa(d,Tuple) ? DataFrame(d) : names!(DataFrame(d[1]), Symbol.(vec(d[2])))
    return z
end    

x = "RCP3PD_MIDYEAR_CONCENTRATIONS.DAT"
d = readfwf(x, header=true, skipstart=38)

NickNack · November 28, 2017, 2:15pm

Thanks for the suggestions guys, but I had already solved my problem when I posted (not as elegantly as @Mattriks’ little function though). I was just surprised that some of the functionality of readtable() had fallen through the cracks during the transition to the new DataFrames version and wanted to shine some light on it.

Diego_Javier_Zea · November 29, 2017, 11:32am

Parsing fixed width columns depending on a variable number of spaces could create (sometimes silent) bugs when there is no space between fields. In the best scenario, this will end up in an error (e.g., an error for having a shorter row).

Topic		Replies	Views
Reading fixed-width files: a preliminary solution General Usage data	5	1897	June 19, 2021
Alternatives to DataFrames.readtable? General Usage	2	711	September 24, 2018
Reading fixed-width files? General Usage	22	3675	March 22, 2020
Current solution for reading Fixed Width files (2024) General Usage data	3	95	August 26, 2024
Ignore consecutive whitespaces with CSV.read(...) Data question	18	20845	October 8, 2019

Is there no standard way to read files with fixed width columns in the new DataFrames ecosystem?

Related topics