Julia Support for File Loading

As a first learning project in Julia I have started to try to convert a project I did in R into Julia. And it turns out I quickly ran into a couple roadblocks. All of which are related to reading in legacy data files like the ones used in the project I selected. While these features are most likely not needed for forward looking applications if Julia is looking to target some industries, especially those that are heavy into COBOL like government or banking, then supporting these features will really help. I have reviewed the CSV package which I feel is the most robust way to read in file based data in Julia and I think adding these features to it could be fairly easy but first I wanted to check that the functionality I need is not elsewhere in Julia, as my knowledge still very entry level.

The files at issue are fixed width files, where each field is not delimited by any characters but are instead always the same width on every line. This snipit is an example of such a file.

000167116NAME WITHHELD BY AGENCY19730930AF**#########30-34 **GS055-9   0322C007951*15F1
000205258NAME WITHHELD BY AGENCY19730930AF**#########30-34 08GS0515-19 0305C009236810F1
000223093NAME WITHHELD BY AGENCY19730930AF**#########20-24 **WG01< 1   7408B006781***F2
000226198NAME WITHHELD BY AGENCY19730930AF***********35-39 ******< 1   ****       8  *2
000246521NAME WITHHELD BY AGENCY19730930AF***********20-24 ******< 1   ****       8  *2

Julia is missing capabilities that make parsing this sort of file a multiple pages of code instead of a couple lines of code. These features are:

  • Support for fixed field width parsing in addition to delimited parsing
  • Support for a list of NA/Null values that are not just empty fields. In the above sample ,**, and ######### are simple, but not only, examples.
  • Date format can be specified on a per-field basis. A file might have different formats like YYYYMMDD YYYYMM or DDMMYYYY all in a single file.

I will most likely look at writing some of this functionality myself for this project but wanted to check with the community to hear thoughts on how to support older datasets like this one.

4 Likes

You’ll probably be interested in Reading Fixed-Width Column Data.

Julia is perfectly capable of parsing this in a few lines of code without any special library (you read lines, extract fields, parse them). Eg with something like

using Missings

null_to_missing(x) = isnull(x) ? missing : unsafe_get(x)
parsefield(::Type{Int}, string) = null_to_missing(tryparse(Int, string))
parsefield(::Type{String}, string) = copy(string)
parsefield(df::DateFormat, string) = null_to_missing(tryparse(Date, string, df))

then iterating on

colspecs = (1:12 => Int,
            13:32 => String,
            33:40 => DateFormat("yyyymmdd"))

You can filter for missing values (which literally could be anything in these kinds of datasets), use generated functions for iteration, etc.

The difficult part is establishing the field boundaries. There is no reliable algorithmic way to do this, so these kind of files usually come with specs. If you are unlucky, it is a scan of a bunch of typewritten pages from the 1970s. If you are a bit more lucky, it is something semi-standard like DDF. You may find this topic and the one linked there useful.

Reading in column specs formats would be a great addition to the ecosystem. The abovementioned DDF is used in social sciences, I am sure other fields have their own formats. R seems to support some of these.

I think your post is a good start and a place to start. I agree the hard part is the field boundaries, but someone looking to parse fixed width files hopefully has one and I do in this case.

Of course I could take the easy way out and just convert the files to an rds and import that. But that removes the fun of learning :grinning:

One thing to be very careful about here - those column specs are character based, not code unit based.
Most likely, the original file was written with some 8-bit character set, and if you do manage to get it converted to UTF-8, or if it already has been, you’ll run into problems because the indices of the field in Julia will be variable, depending on the data.

That’s why using something that will always give you character based addressing (such as the UniStr type in the Strs.jl package, as soon as that is finished (shortly, I believe)), will do much better for this sort of processing.

I have never seen anything but 7-bit ASCII for fixed width, but if that is a concern, one could read a Vector{UInt8}, and extract fields from there, converting as necessary to UTF8. So I don’t quite see why/how I would use your package in this context.

How do you think people stored names for people outside of English speaking countries?

How would that even work? If the file has already been converted from whatever 8-bit original character set it was in to UTF-8, each line would be a variable number of bytes, depending on how many characters ended up being encoded in 2-4 bytes. You’re just pushing the work of handling the complexities of UTF-8 encoding on all programmers.

You would do something like readlines(file, UniStr), and then the fields could be accessed simply by getting substrings, i.e. line[295][13:32] would return the string field from characters 13 - 32 in line 295.

It would also work with standard String, you just need to call nextind/prevind and pass the number of characters to move by. But the OP’s example contains any non-ASCII characters, and it’s not mentioned either, so it’s not obvious it will be needed for this particular application.

2 Likes

Although he didn’t mention non-ASCII data specifically, he was asking about the lack of general capabilities in Julia to handle this sort of problem, important for banking and government, and he did not say he was only interested in dealing with only ASCII data (even in the U.S., with the number of Spanish names with accents in them, or Canada, with French names, being able to handle at least ISO-8859-1 or CP-1252 would be critical).

If you have code that works in R, you could just use it combined with the RCall package to get the data into Julia. It’s really easy. An example would look like this.

using RCall, DataFrames
R"""
# R code which loads the data into a data frame (or similar object) called B
"""
@rget B
B # B is now a DataFrame in Julia

I’ve been using this strategy myself. R is very good at importing data, particularly combined with libraries like Haven. Combined with RCall, I can do the initial processing in R, but use Julia to do something complicated.

2 Likes

In this case the data is all ASCII. I am going to work on a quick framework to solve the loading problem at a high level. For now I would just use typical mechanisms to solve the problem but I would hope that I build it in such a way that whatever converts the raw byte data to a character string would be capable of converting the string into a proper character set in cases when it is needed.

Tamas is probably thinking about files with something like CP-1250 encoding.

It is question about readability and performance.

help?> readline
search: readline readlines readlink

  readline(io::IO=STDIN; keep::Bool=false)
  readline(filename::AbstractString; keep::Bool=false)

  Read a single line of text from the given I/O stream or file (defaults to STDIN). When reading from a file, the text is assumed to be encoded in UTF-8. 

Few lines of code will grow if you will thinking more about problem.

Not really. I have must have read about a TB of fixed with data with <100 LOC. The challenging part is not reading and parsing, but where the data ends up: for short datasets I use a DataFrame, otherwise custom solutions.

I believe you! :slight_smile:

But I think that you are probably thinking like researcher who need to read few data-sources (doesn’t matter how big they are) time to time and importing data is non-repeating single pass task.

Topic is about doing “COBOL like government or banking” job where are plenty of files with different structures, encodings, date format, decimal signs, etc.

IT industry work is about mix of systems where transition period is long and these kind of files are and will be generated continuously several years (maybe “forever”). I understand there is need to have simply configurable tool to read, transform and save/upload/insert_to_db…

It could not be very easy to write (and maintain) tool with good performance and easy to use for this kind of tasks. But probably good idea and easy enough for first learning project! :slight_smile:

1 Like

I have an somewhat trivial, unpublished package along the lines above which I could polish up and make available, possibly when v0.7 comes out (because I find the new String framework nice).

I have not made it public because it does not address the following questions:

  1. how to get the column layout (the elephant in the room; most of my errors come from miscoding that),
  2. how much one can rely on characters mapping to bytes. It currently assumes that everything is ASCII.
  3. fancy things like column autodetection.

IMO fixed width data does not make sense with encodings like UTF8, and column autodetection can lead to nasty surprises (eg detecting on the first N elements, than finding that in row K \gg N someone had the bright idea of using XXXXXX00 for missing values without mentioning it in the codebook). So I am currently using a full two-pass approach for ingesting data, which is not the style followed by CSV.jl and similar packages.

2 Likes

It’s not about the size of your data, it’s about the diversity of your data (and it seems that yours is not “diverse” at all - all ASCII, for example).

Along this dimension, possibly not. Thank heavens for that. Your situation may be different, but for me, exotic encodings pretty much disappeared from the radar shortly after 2000. Even fixed width is a dinosaur that is disappearing; most sane data providers would use CSV or similar.

Also, given that the OP did not ask for non-ASCII, I don’t see why you keep raising this point repeatedly.

Did you not read his reply?

He very specifically stated that it is only in this particular case that his data is all ASCII, but that he wants to have it handle other character sets.

Am I missing something when I think that this could be as simple as

using StringEncodings
open(path, enc"LATIN2", "r") do io
    read_fixed_width_somehow(io, ...)
end

?