Julia Support for File Loading

Tamas_Papp · January 29, 2018, 4:25pm

Julia is perfectly capable of parsing this in a few lines of code without any special library (you read lines, extract fields, parse them). Eg with something like

using Missings

null_to_missing(x) = isnull(x) ? missing : unsafe_get(x)
parsefield(::Type{Int}, string) = null_to_missing(tryparse(Int, string))
parsefield(::Type{String}, string) = copy(string)
parsefield(df::DateFormat, string) = null_to_missing(tryparse(Date, string, df))

then iterating on

colspecs = (1:12 => Int,
            13:32 => String,
            33:40 => DateFormat("yyyymmdd"))

You can filter for missing values (which literally could be anything in these kinds of datasets), use generated functions for iteration, etc.

The difficult part is establishing the field boundaries. There is no reliable algorithmic way to do this, so these kind of files usually come with specs. If you are unlucky, it is a scan of a bunch of typewritten pages from the 1970s. If you are a bit more lucky, it is something semi-standard like DDF. You may find this topic and the one linked there useful.

Reading in column specs formats would be a great addition to the ecosystem. The abovementioned DDF is used in social sciences, I am sure other fields have their own formats. R seems to support some of these.

Topic		Replies	Views
Reading fixed-width files? General Usage	22	3675	March 22, 2020
Current solution for reading Fixed Width files (2024) General Usage data	3	95	August 26, 2024
Reading fixed-width files: a preliminary solution General Usage data	5	1897	June 19, 2021
CSV Reading (rewrite in C?) Internals & Design	50	5068	October 1, 2018
Is there no standard way to read files with fixed width columns in the new DataFrames ecosystem? Data	6	2604	November 29, 2017

Julia Support for File Loading

Related topics