I am pretty excited about the Fannie Mae data being released so I tried to load the data into Julia. Unfortunately, it’s too much work trying to produce a MWE as being able to load large datasets is part of the point. But if you are interested you should definitely download the data from the Fannie website.
I have extracted the first file in the Performance dataset so I tried to load it using CSV.jl, TextParse.jl, and uCSV.jl. All of them failed to load the data even after setting delim to ‘|’. This is important for wider adoption of Julia for data tasks; as I can see how someone who is not invested in Julia will move on to pandas and/or R at this point, as those tools just work. So some catching up to do on reading CSVs.
R’s data.table’s fread
was quite good in this regard as I can just call fread
and it reads it fine and I didn’t even have to specify the delim as it was auto-detected. Neat!
I got it to work in Julia by setting all the types, and because most columns contained missing, I needed Union{...,Missing}
and that might have impacted performance. But it took 45 seconds vs data.table’s 20 seconds.
So it’s fair to ask; why use Julia at all? Because I think it has potential, especially with the JuliaDB.jl being worked on, and I think I can optimize many of the tasks to be faster than R in the future. But for now R seems to be better at manipulating “medium” data like Fannie Mae’s.
I will summarize some of my findings here
- Using RCall.jl to read the CSV and saving it as feather file is a good strategy; as it’s almost as quick, and you don’t need to specify the types and it enables faster reads in the future by just reading the feather files
- If you specify all the column types then your chance of being able to read it increases and it is slightly faster than 1.
Further details below for those that are interested
Code to read in data in Julia
using DataFrames, CSV, Missings
dirpath = "d:/data/fannie_mae/"
dirpaths = readdir(dirpath)
filepath = joinpath(dirpath, "Performance_2000Q1.txt")
const types = [
String, Union{String, Missing}, Union{String, Missing}, Union{Float64, Missing}, Union{Float64, Missing},
Union{Float64, Missing}, Union{Float64, Missing}, Union{Float64, Missing}, Union{String, Missing}, Union{String, Missing},
Union{String, Missing}, Union{String, Missing}, Union{String, Missing}, Union{String, Missing}, Union{String, Missing},
Union{String, Missing}, Union{String, Missing}, Union{Float64, Missing}, Union{Float64, Missing}, Union{Float64, Missing},
Union{Float64, Missing}, Union{Float64, Missing}, Union{Float64, Missing}, Union{Float64, Missing}, Union{Float64, Missing},
Union{Float64, Missing}, Union{Float64, Missing}, Union{Float64, Missing}, Union{String, Missing}, Union{Float64, Missing},
Union{String, Missing}]
@time perf = CSV.read(filepath, delim='|', header = false, types = types) #45;
Read using RCall.jl and saving as feather
Interestingly, I can read the CSV using R’s data.table’s fread
and save the data as Feather and read in the saved feather file in about 50 seconds. This seems to be a better strategy as saving a feather file in Julia takes longer than in R; @time Feather.write("julia_feather.feather", perf) # 27
using RCall
using Feather
function rfread(path)
R"""
feather::write_feather(data.table::fread($path), 'tmp.feather')
gc()
"""
Feather.read("tmp.feather")
end
@time perf = rfread(filepath) # 50 seconds
@time Feather.read("tmp.feather") # 9 seconds only