Reading fixed-width files?

I have data files (containing parameters for a 3rd-party, proprietary model) that are supplied in fixed-width format. Unfortunately, the specific format is for Fortran’s fixed field widths, with the effect that some values have no spaces between the numbers, i.e.

  11 99 ...
  11100 ...
  11111 ...

In Fortran (which I haven’t used much since the 90’s…) I would use

    read( line, '(I4)' n

or something along those lines (my memory is vague, but I still know where to look this up. I think… :thinking:)

Is there a package in Julia that allows for easy reading of these files when the field widths are specified? I know CSV will handle fixed-width files, but the documentations seems to indicate this is more along the lines of ignoring multiple spaces between values, than for this type of file.

I could write a custom parser, along the lines of this discussion, but if there is existing code, would much prefer that to having little pieces of once-off code all over the place.

If there is no existing package for this, I’ll cobble something together and post on Github, should anyone care to use it.

Any pointers/recommendation would be highly appreciated.

CSV.jl

I did look at CSV.jl, specifically here.

Unless I am misreading the text, this allows for multiple spaces, not zero spaces.

I would mmap the whole file, then just parse “views” of the large byte array, eg with

https://github.com/JuliaData/WeakRefStrings.jl

I do this for fixed with data — it is so embarrassingly simple that so far I have not been motivated to package it up.

4 Likes

I see. CSV.jl doesn’t do fwf, it’s just a clever trick to read multiple delim as one so it simulates the reading of fwf where there is always a delim.

Hi @Tamas_Papp, can you elaborate on what you mean by this? I am in a similar boat as OP with my use case, the difference being that I am fairly new to Julia and programming in general, so I don’t quite understand what you’re saying (or rather, I can’t visualize what the implementation would look like). The same goes for your comment in the referenced discussion:

You can just read line by line, extract the columns using indices, then parse. The tricky part is getting the column indices, if you are lucky there is some metadata that describes that (eg US Census Bureau usually supplies such metadata).

I recently made a post about this on Reddit, the relevant part here:

With pd.read_fwf you can set the colspecs kwarg to an array of tuples where each tuple signifies where each column starts and ends. That’s a really nifty feature. It doesn’t seem like any current Julia package has that functionality, but I’d love to be wrong.

In other words, I think having a colspecs-like kwarg in either CSV.jl or TableReader.jl would be a really nice addition to either package. Although what you’re recommending might be really simple for more advanced developers like yourself, it can be a bit intimidating for people like me. Moreover, at my job where we’re having a little tech revolution and trying to move away from Excel, most people know of Python and less so Julia. If someone was to see how easy it is to read fixed width files with Pandas vs what comes up when you Google the problem for Julia, most people might just opt for using Python for this type of use case.

For some more context, I work in insurance where (at my company at least) we rely on a decent amount of legacy technology, so fixed width formats are not uncommon. I think Julia is ideally suited for insurance/actuarial work, and it’s only a matter of time before it becomes an industry standard. However, it currently suffers from a bit of a “recognizability” problem – everyone and their mother have heard of R and Python (my mother has heard of Julia, but that’s because of me). Little niceties like colspecs (which is very Excel/Access-like in its handling of fixed width columns) can go a long way toward driving adoption.

Now having said that I’m a novice, what I’m about to ask might seem a bit silly – how difficult would it be for someone like me to attempt to either submit a PR to Tablereader.jl or CSV.jl or revive FWF.jl for v1.0 compatability? A contribution of this sort would surely help me get better at Julia, but it also seems quite beyond my current abilities. @braamvandyk also offered to package this up, so if that offer is still on the table, I can try to help in whatever way I can.

Reading the file line by line and extracting columns is the simplest. See

https://docs.julialang.org/en/v1/manual/networking-and-streams/

Writing an API for this is a nice exercise, and I imagine it would take less than 100 LOC. If you are otherwise programming in Julia, it is not a big deal.

I’m also going to be working with (reading and writing) fixed width files a lot more due to some new systems at work. It seems like there’s no package that works well with DataFrames and does both reading and writing? I’d be willing to spend some time this weekend to try a package or pull request to make working with FWF files easier.

Pinging @quinnj/ @bkamins as well - if you had advice on where to start would it be to update FWF.jl? Fork CSV.jl? Something new entirely?

How much functionality would you require, in particular:

  • do you need it to be super fast or just reasonably fast? (if you want it fast probably CSV.jl should be preferred)
  • do you need to have autodetection of column types when reading? (again CSV.jl is very good here)
  • do you need to have autodetection of column widths when reading (this is tricky)? (if you have complex requirements here then probably FWF.jl is a better place to start)
  • do you want FWF to be fixed characters width or fixed codeunits width? (again if your requirements here are complex then probably FWF.jl is a better start)

@quinnj - adding ability to write FWF to CSV.jl should be quick. Adding ability to read FWF without autodetection of field widths (assuming that column widths would be just passed and stripping blanks) also should be doable in CSV.jl without huge effort - right?

1 Like

Yes, I agree with all the responses here. CSV.jl has the ability to ignorerepeated=true which allows ignoring cases where repeated delimiters may occur between fields. It wasn’t meant to be a full fwf solution by any means, but ends up working in a surprising number of cases (and very performant as well).

I’ve never used/checked out FWF.jl, but it seems a bit out-of-date just from the readme.

CSV.jl has a lot of pieces that can help here though; the CSV.Rows allows for iterating rows of a file pretty efficiently; CSV.detect allows figuring out the type of a single cell as a string. It’d be nice to find a way to re-use a bunch of this machinery for a proper fwf solution. If I were approaching it, I’d probably copy a bunch of the CSV.Rows code, pare it down to do the core fwf parsing desired, then try and support as many additional features that CSV.File already supports (no simple task! but could be done incrementally over time).

1 Like

@RandomString123, the owner of FWF.jl, who’s too busy to work on it at the moment, invited me to collaborate on it since I requested to take it on and revive it. It was supposed to be a learning project for me, but I’ve also been too busy with work and haven’t had a chance to work on it all (though I’d still like to, for the practice). If @quinnj can incorporate more robust FWF reading features into CSV.jl, that would probably be the most ideal solution of all.

2 Likes

@quinnj - you do need to look too deeply into FWF.jl if you would consider to add this functionality to CSV.jl.

There are essentially two things to consider (in reading, as writing FWF is a relatively simpler thing - the only thing to care about there is if fields are in characters or in bytes):

  • allow CSV.File to get field widths for FWF format and then pass a rule for stripping characters from that field (usually this is whitespace)
  • add a kwarg to CSV.File where you can specify if field widths are specified in characters or in bytes (i.e. if the file should be treated as text or as binary)

(maybe this should be simply a separate function CSV.FileFWF)?

1 Like

I often need to read fw files, so implemented a very simple solution: https://gist.github.com/aplavin/224a31ea457b6e0ef0f4c1a20bd28850. I’ve been using it for quite some time now, and didn’t have any major issues or missing features - but for the obvious lack of autodetection.

Thank you for sharing it. The tricky part is the following statement in your code:

line[from:min(length(line), to)]

which actually mixes character based and byte based indexing assumption (for ASCII text they are equivalent but if you go to full UTF-8 this will start to differ and your code could simply throw a string indexing error).

Thai is why I have commented above that exactly this part is the only that really has to be considered carefully in FWF design.

1 Like

Indeed, that’s true. I only encounter ascii FW tables, and didn’t pay attention to the difference here. TBH, I still don’t quite understand the reason for strings to be indexed by bytes instead of characters. Anyway, would be glad if someone implements a better/more general/more optimal solution, but this is what I got when needed to read such tables.

You are saying that FWF is hard because unlike delimited files, you don’t just separate based on the delimiter, you have to pay attention to what’s the definition of a column width? In other words, you have the pay much greater attention to the character encoding?

I think that assuming ASCII is perfectly fine with fixed width files.

Using a variable-length encoding with FW is insane. I almost said that I have never seen it in the wild, then realized that someone, somewhere (probably a poorly-lit basement with ominous music playing in the background) is probably doing it. But this does not mean software should cater to this.

I would consider a library well-designed if it assumed ASCII, treated the bytes as characters, and perhaps verified that all are <= 127 since this operation is extremely cheap.

2 Likes

In my personal experience, limited as it may be, these files are usually generated from Fortran code. I have also never seen one that wasn’t ASCII.

Then again, people who generate files where columns of floating point numbers run into each other and can only be distinguished via the column widths are probably capable of then saving the file as UTF-8 in the name of “progress” :disappointed:

While I would love to see the functionality in CSV.jl, to the point of supporting the ASCII-only approach, I wonder whether the cost of diluting the “it just works” nature of Julia in special cases here and there is worth the benefit. Or am I making more of this than needs be? I probably am.

This is true in majority of the cases. However, I even remember that @RandomString123 in Handle bigendian generated files · Issue #7 · RandomString123/FWF.jl · GitHub noted that in practice problems with multi-byte characters could be encountered. E.g. column headers are a first natural place to possibly expect them.

I definitely agree that if there will be a reasonably general package for FW tables, it should support characters outside of ASCII. My code linked above doesn’t deal with those because of one simple reason: when writing it I didn’t know that julia doesn’t transparently support arbitrary characters in strings. Actually, I personally encountered several FW tables with non-ASCII characters: one case was in header names, and another in data rows themselves - the “plus-minus” sign separating measurement value and its uncertainty. As I didn’t know Julia well enough, I didn’t understand why my code errors in these cases, so ended up just removing all non-ASCII characters from the file a-priori. But again, a common package should support any characters; don’t think it is too difficult, is it?