I may misunderstand the issue you linked, but I don’t see the example where this happens in practice. It just raises the possibility.
Some formats, eg recent incarnations of Stata’s dta
format, store UTF8 in fixed width fields. The way that works is that UTF8 is just considered a byte string, which is then padded/read as is. Eg "ηβπ"
would take 6 bytes.
This is pretty much the only format that makes sense. “Fixed width” in characters coded in a variable-length encoding like UTF8 (which is pretty much all that should be practically relevant, even though it is easy to support just about anything else in Julia) makes no sense as it throws out all the actual advantages of fixed width.
Again, I am sure there is someone using that to store data. But it is not something a sane library would even consider supporting because it requires an entirely different approach.
While all the widths should (probably?) indeed be fixed in bytes, from a user POV it makes sense to specify these widths in terms of characters. I.e. if I open the table file in a text editor, I can only see and count characters - not bytes. And these character counts are what should be specified as column widths and positions.
The issue I have linked is just a summary of the discussion where if I recall correctly such files were occuring in practice (if I am not mistaken they were generated using COBOL on mainframes).
Anyway FWF.jl by default used byte width but it can be switched using a kwarg. I guess the simplest thing to do for someone interested in having a common FWF reader/writer is to make a PR to GitHub - RandomString123/FWF.jl: Fixed width file parsing in Julia to make it work on modern Julia.
In the long run probably having it in CSV.jl, if @quinnj would consider this, would be the best option as there is loads of parsing functionality already in CSV.jl that is vastly superior to FWF.jl.