My experiences reading CSVs from the Fannie Mae datasets

xiaodai · February 1, 2018, 10:49am

I am pretty excited about the Fannie Mae data being released so I tried to load the data into Julia. Unfortunately, it’s too much work trying to produce a MWE as being able to load large datasets is part of the point. But if you are interested you should definitely download the data from the Fannie website.

I have extracted the first file in the Performance dataset so I tried to load it using CSV.jl, TextParse.jl, and uCSV.jl. All of them failed to load the data even after setting delim to ‘|’. This is important for wider adoption of Julia for data tasks; as I can see how someone who is not invested in Julia will move on to pandas and/or R at this point, as those tools just work. So some catching up to do on reading CSVs.

R’s data.table’s fread was quite good in this regard as I can just call fread and it reads it fine and I didn’t even have to specify the delim as it was auto-detected. Neat!

I got it to work in Julia by setting all the types, and because most columns contained missing, I needed Union{...,Missing} and that might have impacted performance. But it took 45 seconds vs data.table’s 20 seconds.

So it’s fair to ask; why use Julia at all? Because I think it has potential, especially with the JuliaDB.jl being worked on, and I think I can optimize many of the tasks to be faster than R in the future. But for now R seems to be better at manipulating “medium” data like Fannie Mae’s.

I will summarize some of my findings here

Using RCall.jl to read the CSV and saving it as feather file is a good strategy; as it’s almost as quick, and you don’t need to specify the types and it enables faster reads in the future by just reading the feather files
If you specify all the column types then your chance of being able to read it increases and it is slightly faster than 1.

Further details below for those that are interested

Code to read in data in Julia

using DataFrames, CSV, Missings

dirpath = "d:/data/fannie_mae/"
dirpaths = readdir(dirpath)
filepath = joinpath(dirpath, "Performance_2000Q1.txt")

const types = [
    String,                     Union{String, Missing},     Union{String, Missing},     Union{Float64, Missing},    Union{Float64, Missing}, 
    Union{Float64, Missing},    Union{Float64, Missing},    Union{Float64, Missing},    Union{String, Missing},     Union{String, Missing},
    Union{String, Missing},     Union{String, Missing},     Union{String, Missing},     Union{String, Missing},     Union{String, Missing}, 
    Union{String, Missing},     Union{String, Missing},     Union{Float64, Missing},    Union{Float64, Missing},    Union{Float64, Missing}, 
    Union{Float64, Missing},    Union{Float64, Missing},    Union{Float64, Missing},    Union{Float64, Missing},    Union{Float64, Missing}, 
    Union{Float64, Missing},    Union{Float64, Missing},    Union{Float64, Missing},    Union{String, Missing},     Union{Float64, Missing}, 
    Union{String, Missing}]

@time perf = CSV.read(filepath, delim='|', header = false, types = types) #45;

Read using RCall.jl and saving as feather
Interestingly, I can read the CSV using R’s data.table’s fread and save the data as Feather and read in the saved feather file in about 50 seconds. This seems to be a better strategy as saving a feather file in Julia takes longer than in R; @time Feather.write("julia_feather.feather", perf) # 27

using RCall
using Feather
function rfread(path)
    R"""
    feather::write_feather(data.table::fread($path), 'tmp.feather')
    gc()
    """
    Feather.read("tmp.feather")
end

@time perf = rfread(filepath) # 50 seconds

@time Feather.read("tmp.feather") # 9 seconds only

Tamas_Papp · February 1, 2018, 11:42am

I don’t know if you did this already, but typically opening an issue is the most helpful and constructive thing to do if you find a bug.

xiaodai · February 1, 2018, 11:49am

yeah. i have done one and will do more once i find a mwe. but it’s hard as the files are quite large, so it takes time to find the culprit.

Liso · February 1, 2018, 12:02pm

Sample files are without problem?

https://loanperformancedata.fanniemae.com/lppub-docs/performance-sample-file.txt https://loanperformancedata.fanniemae.com/lppub-docs/acquisition-sample-file.txt

avik · February 1, 2018, 12:04pm

Thanks @xiaodai. There is no doubt JuliaDB needs to be able to work with this data seamlessly, and quickly. Finding and using real world data however is not always an easy task, so reports such as these are very helpful. They’ll certainly help improve JuliaDB as well as other Julia packages.

Regards

Avik

xiaodai · February 1, 2018, 12:08pm

recommend that you download the full data.

nalimilan · February 1, 2018, 5:14pm

I guess the failure to load when you don’t specify the types was due to the fact that missing values appear only beyond the first rows used for type detection? We really need to do something about this. BTW, it would have been useful to post the error message.

Tamas_Papp · February 1, 2018, 5:45pm

I think that reading all the data before type detection would be a reasonable default. For small datasets, the performance drop is not an issue, and it would not fail on large datasets, just be slower. Users who read large data and care about speed more than the price of occasional failure could still restrict it to the first few rows.

The alternative is “widen when necessary”, similar to the mapping/collecting functions of Base, with the understanding that here you will guess wrong initially most of the time, but things should work out.

sswatson · February 1, 2018, 5:48pm

Just to note that one can fix this problem by setting the rows_for_type_detect keyword argument of CSV.read to a sufficiently large number (at risk of saying something that may be obvious to some). Clearly this is not the permanent solution you’re calling for, but I found it to be the simplest path to getting my code working. For users who are dealing with this problem, it could be worth a try.

shashi · February 1, 2018, 5:56pm

I was able to load the file using JuliaDB:

loadtable("Performance_2000Q1.txt",
     delim='|',
     nastrings=vcat(TextParse.NA_STRINGS, "X"),
     header_exists=false);

The options: 1) set | as delimiter, 2) considers X as a missing value (along with predefined ones), 3) says that headers don’t exist.

It took 80 seconds though… But can probably be improved if you specify all the column types.

In the process however I encountered a very unhelpful error message (when I didn’t specify header_exists=false) so that’s something I will try and fix.

piever · February 1, 2018, 6:06pm

It seems to me there are two issues at the moment:

File starts with too many non-missing values before encountering a missing: it fails because it assumes the column does not allow missing data
File starts with too many missing values before encountering a non-missing: it fails because it infers the column is of type Missing

I’d tend to agree with @tpapp that it’s probably better to be slow but correct by default.

nalimilan · February 1, 2018, 6:31pm

Agreed, we really need to fix that failure when missing values appear to late in the data. Until we implement something more efficient, I guess we could simply catch the exception and restart parsing the file with a wider type for the offending column.

piever · February 1, 2018, 6:43pm

I think TextParse is pretty good at this and doesn’t seem to be affected by error 1) or 2). Maybe whatever strategy it’s using could be implemented by CSV as well?

Another not very subtle approach is to start with all columns allowing missing data and make them stricter at the end. Does one need to copy the data to go from Vector{Union{T, Missing}} to Vector{T} ?

nalimilan · February 1, 2018, 7:22pm

No idea. Anyway the strategies are relatively clear, we just need somebody to do the work.

In theory no copy would be needed in either direction, so it would be even better to start with a Vector{T}, which is faster, and widen it if needed. Currently a copy is made, though, but it shouldn’t be too bad in terms of performance compared with the time required to parse the file.

pasha · February 1, 2018, 8:14pm

As you know the dataframe ecosystem in Julia is going through some rework with Nullables as DataFrames 0.11 gets to a full release.

In this state I also had a number of travails in getting a padded CSV with a blank line in it. DataFrames.readtable() is soon to be deprecated but has the ignorepadding parameter I needed.

xiaodai · February 1, 2018, 9:00pm

Yeah. Adding more rows to type was something I tried but eventually it gave another error which was hard to decipher. It’s hard to give a MWE as the file is large; so downloading the data and setting up a test suite to test “big ticket” datasets like Fannie Mae and Lending Club is the way to go

xiaodai · October 24, 2018, 11:07pm

Happy to report that TextParse.jl can read the Fannie Mae data with minimal help now

using CSVFiles, FileIO
@time a = load(File(format"csv", file_path), delim = '|', type_detect_rows = 1000)

works (!!!) for the first Performance files! The same code without type_detect_rows still fails though, and it can’t auto detect the dlimiter like R’s fread. So it’s not yet able to topple data.table but it’s still a good performer now!

xiaodai · February 20, 2019, 12:12pm

Pretty excited. I can finally load the Fannie Mae data using JuliaDB

Edit actually it failed complaining about out of memory.

using Distributed
addprocs(6)

@time @everywhere using JuliaDB, Dagger

datapath = "c:/data/Performance_All\\"
ifiles = joinpath_(datapath, readdir(datapath))

colnames = ["loan_id", "monthly_rpt_prd", "servicer_name", "last_rt", "last_upb", "loan_age",
    "months_to_legal_mat" , "adj_month_to_mat", "maturity_date", "msa", "delq_status",
    "mod_flag", "zero_bal_code", "zb_dte", "lpi_dte", "fcc_dte","disp_dt", "fcc_cost",
    "pp_cost", "ar_cost", "ie_cost", "tax_cost", "ns_procs", "ce_procs", "rmw_procs",
    "o_procs", "non_int_upb", "prin_forg_upb_fhfa", "repch_flag", "prin_forg_upb_oth",
    "transfer_flg"];

@time jll = loadtable(
    ifiles,
    output = "c:/data/fm.jldb\\", 
    delim='|', 
    header_exists=false, 
    filenamecol = "filename", 
    chunks = 256, 
    type_detect_rows = 2000, 
    colnames = colnames, 
    indexcols=["loan_id"])

Yifan_Liu · March 1, 2019, 10:45pm

Does JuliaDB actually provide out of core functionally as it claims? I also want to explore this feature a little bit. As for now, I haven’t seen any good working example of this except for the one that uses TrueFX data.

xiaodai · March 1, 2019, 11:18pm

Yeah. Still having trouble loading CSVs

Topic		Replies	Views
[ANN] Fread.jl - read CSVs faster with the help of R's {data.table} Package Announcements performance , data , csv	6	2035	October 9, 2019
Trying to analyse Fannie Mae data with JuliaDB Data juliadb	14	1337	August 31, 2019
Reading Data Is Still Too Slow Data	35	8738	August 2, 2019
Why do you use JuliaDB? General Usage	9	2129	October 28, 2019
CSV Reader Benchmarks: Julia Reads CSVs 10-20x Faster than Python and R General Usage announcement	68	8663	March 23, 2022

My experiences reading CSVs from the Fannie Mae datasets

Related topics