Reading fixed-width files: a preliminary solution

After browsing through the discussions about reading fixed-width files in Julia (e.g. here and here) I still hadn’t found a solution that was general enough for my case. I wrote a quick one I’d like to share here.

Suppose you have some fixed-width data like this:

A      B                            C
 12345 SOME VERY LONG STRING        T
 23456 ANOTHER VERY LONG STRING     T

First, initialize a DataFrame to receive the data.

using DataFrames
df = DataFrame(A = String[], B = String[], C = String[])

Then define the ranges of each of the columns.

ranges = ((1,7), (8,36), (37,38))

You can then pass them to a function like this that reads the individual lines, extracts the data from the column and structures it into a vector, then appends that vector to the DataFrame.

import Base.Iterators: peel
function readfwf!(source, df, ranges)
    lines = readlines(source)
    (names, lines) = peel(lines) # skip first line
    for row in lines
        data = String[]
        for r in ranges
            push!(data, strip(SubString(row, r[1]:r[2])))
        end
        push!(df, data)
    end
end

There’s obviously no parsing the input strings to convert them to other data types, but that could be added in easily enough. Using this code I was able to construct a DataFrame with 4 columns and over 5 million rows in ~1.7 seconds.

Until this functionality gets added to DelimitedFiles or CSV.jl I hope some members of the community find this useful.

4 Likes

I suggest you write a quick and dirty module for this and post the source code for the module.

Then it would be easier for others to test out your module and if the feedback is positive, it might even be turn into a package.

1 Like

note that you can use:

ranges = (1:7, 8:36, 37:38)

and then also simplify a little bit the code using

for row in lines
    push!(df, [ strip(row[r]) for r in ranges ])
end

Also I think readlines is reading the whole file into memory in advance. You could use eachline instead, which produces an iterator (I may be wrong here).

4 Likes

There is a lot of machinery that CSV parsers have invested into making reading and parsing tabular data super-fast, using lazy mappings and other techniques.

Since fixed width formats already know where each cell starts and ends, a lot of code could be reused. There is some preliminary discussion at

2 Likes

Excellent, appreciate the feedback.

This would be ideal. I looked through the source code for CSV.jl but wasn’t able to pinpoint where exactly the delimiter was used to create cell widths. (I got lost when the delimiter gets passed to Parsers functions, e.g. here.)

Here’s a minimal repo with a short example. It’s very quick and very dirty, but it’s something people can reuse at least.

3 Likes

For my own future reference, and the interest of anyone else.
Not the same problem as the OP, but a similar one, so someone might encounter it when searching.

I have a “fixed” width format where the final column is allowed to contain the delimitter , but none of the others are;
and where I don’t know the width until i get the table, as it pads to make things fit.
But i do know how many columns there are, and their types.

I think this format is common when parsing the output of shell commands like ls -lah or in my case wmctrl
format looks like:


shell> wmctrl -l -G
0x02800003 -1 0    56   3760 2372 Aji @!0,28;BDH
0x02000001  0 324  471  2677 1942 Aji Slack | data | Julia
0x04200003  0 694  77   1880 2372 Aji Reading fixed-width files: a preliminary solution - Usage - JuliaLang — Mozilla Firefox
0x04000010  0 42   30   2832 1604 Aji julia-master /home/oxinabox
0x05400038  0 1532 192  2248 2298 Aji new 1 - Notepadqq
parse_one(T::Type{<:AbstractString}, x::AbstractString) = convert(T, x)
parse_one(::Type{Any}, x) = x
parse_one(T, x) = parse(T, x)
function parse_line(N::Type{<:NamedTuple{<:Any,T}}, line) where {T}
    xs = split(line, ' '; keepempty=false, limit=fieldcount(N))
    return N(map(parse_one, T.parameters, xs))
end


function parse_fixed_trailing(io; columns...)
    names = Tuple(keys(columns))
    types = Tuple{values(columns)...}
    (parse_line(NamedTuple{names,types}, line) for line in eachline(io))
end

Example:

julia> table = parse_fixed_trailing(`wmctrl -l -G`; id=String, desktop=Int, x=Int, y=Int, width=Int, height=Int, pc=String, title=String)
Base.Generator{Base.EachLine{Base.PipeEndpoint}, var"#9#10"{DataType, NTuple{8, Symbol}}}(var"#9#10"{DataType, NTuple{8, Symbol}}(Tuple{String, Int64, Int64, Int64, Int64, Int64, String, String}, (:id, :desktop, :x, :y, :width, :height, :pc, :title)), Base.EachLine{Base.PipeEndpoint}(Base.PipeEndpoint(RawFD(22) open, 0 bytes waiting), Base.var"#687#688"{Base.Process}(Process(`wmctrl -l -G`, ProcessExited(0))), false))

julia> using DataFrames

julia> DataFrame(table)
5×8 DataFrame
 Row │ id          desktop  x      y      width  height  pc      title                             
     │ String      Int64    Int64  Int64  Int64  Int64   String  String                            
─────┼─────────────────────────────────────────────────────────────────────────────────────────────
   1 │ 0x02800003       -1      0     56   3760    2372  Aji     @!0,28;BDH
   2 │ 0x02000001        0    324    471   2677    1942  Aji     Slack | data | Julia
   3 │ 0x04200003        0    694     77   1880    2372  Aji     Reading fixed-width files: a pre…
   4 │ 0x04000010        0     42     30   2832    1604  Aji     julia-master /home/oxinabox
   5 │ 0x05400038        0   1532    192   2248    2298  Aji     new 1 - Notepadqq
3 Likes