Reading fixed-width files: a preliminary solution

After browsing through the discussions about reading fixed-width files in Julia (e.g. here and here) I still hadn’t found a solution that was general enough for my case. I wrote a quick one I’d like to share here.

Suppose you have some fixed-width data like this:

A      B                            C
 12345 SOME VERY LONG STRING        T
 23456 ANOTHER VERY LONG STRING     T

First, initialize a DataFrame to receive the data.

using DataFrames
df = DataFrame(A = String[], B = String[], C = String[])

Then define the ranges of each of the columns.

ranges = ((1,7), (8,36), (37,38))

You can then pass them to a function like this that reads the individual lines, extracts the data from the column and structures it into a vector, then appends that vector to the DataFrame.

import Base.Iterators: peel
function readfwf!(source, df, ranges)
    lines = readlines(source)
    (names, lines) = peel(lines) # skip first line
    for row in lines
        data = String[]
        for r in ranges
            push!(data, strip(SubString(row, r[1]:r[2])))
        end
        push!(df, data)
    end
end

There’s obviously no parsing the input strings to convert them to other data types, but that could be added in easily enough. Using this code I was able to construct a DataFrame with 4 columns and over 5 million rows in ~1.7 seconds.

Until this functionality gets added to DelimitedFiles or CSV.jl I hope some members of the community find this useful.

3 Likes

I suggest you write a quick and dirty module for this and post the source code for the module.

Then it would be easier for others to test out your module and if the feedback is positive, it might even be turn into a package.

1 Like

note that you can use:

ranges = (1:7, 8:36, 37:38)

and then also simplify a little bit the code using

for row in lines
    push!(df, [ strip(row[r]) for r in ranges ])
end

Also I think readlines is reading the whole file into memory in advance. You could use eachline instead, which produces an iterator (I may be wrong here).

4 Likes

There is a lot of machinery that CSV parsers have invested into making reading and parsing tabular data super-fast, using lazy mappings and other techniques.

Since fixed width formats already know where each cell starts and ends, a lot of code could be reused. There is some preliminary discussion at

2 Likes

Excellent, appreciate the feedback.

This would be ideal. I looked through the source code for CSV.jl but wasn’t able to pinpoint where exactly the delimiter was used to create cell widths. (I got lost when the delimiter gets passed to Parsers functions, e.g. here.)

Here’s a minimal repo with a short example. It’s very quick and very dirty, but it’s something people can reuse at least.

2 Likes