My experiences reading CSVs from the Fannie Mae datasets

performance
csv

#1

I am pretty excited about the Fannie Mae data being released so I tried to load the data into Julia. Unfortunately, it’s too much work trying to produce a MWE as being able to load large datasets is part of the point. But if you are interested you should definitely download the data from the Fannie website.

I have extracted the first file in the Performance dataset so I tried to load it using CSV.jl, TextParse.jl, and uCSV.jl. All of them failed to load the data even after setting delim to ‘|’. This is important for wider adoption of Julia for data tasks; as I can see how someone who is not invested in Julia will move on to pandas and/or R at this point, as those tools just work. So some catching up to do on reading CSVs.

R’s data.table’s fread was quite good in this regard as I can just call fread and it reads it fine and I didn’t even have to specify the delim as it was auto-detected. Neat!

I got it to work in Julia by setting all the types, and because most columns contained missing, I needed Union{...,Missing} and that might have impacted performance. But it took 45 seconds vs data.table’s 20 seconds.

So it’s fair to ask; why use Julia at all? Because I think it has potential, especially with the JuliaDB.jl being worked on, and I think I can optimize many of the tasks to be faster than R in the future. But for now R seems to be better at manipulating “medium” data like Fannie Mae’s.

I will summarize some of my findings here

  1. Using RCall.jl to read the CSV and saving it as feather file is a good strategy; as it’s almost as quick, and you don’t need to specify the types and it enables faster reads in the future by just reading the feather files
  2. If you specify all the column types then your chance of being able to read it increases and it is slightly faster than 1.

Further details below for those that are interested

Code to read in data in Julia

using DataFrames, CSV, Missings

dirpath = "d:/data/fannie_mae/"
dirpaths = readdir(dirpath)
filepath = joinpath(dirpath, "Performance_2000Q1.txt")

const types = [
    String,                     Union{String, Missing},     Union{String, Missing},     Union{Float64, Missing},    Union{Float64, Missing}, 
    Union{Float64, Missing},    Union{Float64, Missing},    Union{Float64, Missing},    Union{String, Missing},     Union{String, Missing},
    Union{String, Missing},     Union{String, Missing},     Union{String, Missing},     Union{String, Missing},     Union{String, Missing}, 
    Union{String, Missing},     Union{String, Missing},     Union{Float64, Missing},    Union{Float64, Missing},    Union{Float64, Missing}, 
    Union{Float64, Missing},    Union{Float64, Missing},    Union{Float64, Missing},    Union{Float64, Missing},    Union{Float64, Missing}, 
    Union{Float64, Missing},    Union{Float64, Missing},    Union{Float64, Missing},    Union{String, Missing},     Union{Float64, Missing}, 
    Union{String, Missing}]

@time perf = CSV.read(filepath, delim='|', header = false, types = types) #45;

Read using RCall.jl and saving as feather
Interestingly, I can read the CSV using R’s data.table’s fread and save the data as Feather and read in the saved feather file in about 50 seconds. This seems to be a better strategy as saving a feather file in Julia takes longer than in R; @time Feather.write("julia_feather.feather", perf) # 27

using RCall
using Feather
function rfread(path)
    R"""
    feather::write_feather(data.table::fread($path), 'tmp.feather')
    gc()
    """
    Feather.read("tmp.feather")
end

@time perf = rfread(filepath) # 50 seconds

@time Feather.read("tmp.feather") # 9 seconds only

Benchmarking ways to write/load DataFrames IndexedTables to disk
#2

I don’t know if you did this already, but typically opening an issue is the most helpful and constructive thing to do if you find a bug.


#3

yeah. i have done one and will do more once i find a mwe. but it’s hard as the files are quite large, so it takes time to find the culprit.


#4

Sample files are without problem?

https://loanperformancedata.fanniemae.com/lppub-docs/performance-sample-file.txt https://loanperformancedata.fanniemae.com/lppub-docs/acquisition-sample-file.txt


#5

Thanks @xiaodai. There is no doubt JuliaDB needs to be able to work with this data seamlessly, and quickly. Finding and using real world data however is not always an easy task, so reports such as these are very helpful. They’ll certainly help improve JuliaDB as well as other Julia packages.

Regards

Avik


#6

recommend that you download the full data.


#7

I guess the failure to load when you don’t specify the types was due to the fact that missing values appear only beyond the first rows used for type detection? We really need to do something about this. BTW, it would have been useful to post the error message.


#8

I think that reading all the data before type detection would be a reasonable default. For small datasets, the performance drop is not an issue, and it would not fail on large datasets, just be slower. Users who read large data and care about speed more than the price of occasional failure could still restrict it to the first few rows.

The alternative is “widen when necessary”, similar to the mapping/collecting functions of Base, with the understanding that here you will guess wrong initially most of the time, but things should work out.


#9

Just to note that one can fix this problem by setting the rows_for_type_detect keyword argument of CSV.read to a sufficiently large number (at risk of saying something that may be obvious to some). Clearly this is not the permanent solution you’re calling for, but I found it to be the simplest path to getting my code working. For users who are dealing with this problem, it could be worth a try.


#10

I was able to load the file using JuliaDB:

loadtable("Performance_2000Q1.txt",
     delim='|',
     nastrings=vcat(TextParse.NA_STRINGS, "X"),
     header_exists=false);

The options: 1) set | as delimiter, 2) considers X as a missing value (along with predefined ones), 3) says that headers don’t exist.

It took 80 seconds though… But can probably be improved if you specify all the column types.

In the process however I encountered a very unhelpful error message (when I didn’t specify header_exists=false) so that’s something I will try and fix.


#11

It seems to me there are two issues at the moment:

  1. File starts with too many non-missing values before encountering a missing: it fails because it assumes the column does not allow missing data
  2. File starts with too many missing values before encountering a non-missing: it fails because it infers the column is of type Missing

I’d tend to agree with @tpapp that it’s probably better to be slow but correct by default.


#12

Agreed, we really need to fix that failure when missing values appear to late in the data. Until we implement something more efficient, I guess we could simply catch the exception and restart parsing the file with a wider type for the offending column.


#13

I think TextParse is pretty good at this and doesn’t seem to be affected by error 1) or 2). Maybe whatever strategy it’s using could be implemented by CSV as well?

Another not very subtle approach is to start with all columns allowing missing data and make them stricter at the end. Does one need to copy the data to go from Vector{Union{T, Missing}} to Vector{T} ?


#14

No idea. Anyway the strategies are relatively clear, we just need somebody to do the work. :wink:

In theory no copy would be needed in either direction, so it would be even better to start with a Vector{T}, which is faster, and widen it if needed. Currently a copy is made, though, but it shouldn’t be too bad in terms of performance compared with the time required to parse the file.


#15

As you know the dataframe ecosystem in Julia is going through some rework with Nullables as DataFrames 0.11 gets to a full release.

In this state I also had a number of travails in getting a padded CSV with a blank line in it. DataFrames.readtable() is soon to be deprecated but has the ignorepadding parameter I needed.


#16

Yeah. Adding more rows to type was something I tried but eventually it gave another error which was hard to decipher. It’s hard to give a MWE as the file is large; so downloading the data and setting up a test suite to test “big ticket” datasets like Fannie Mae and Lending Club is the way to go