Trying to analyse Fannie Mae data with JuliaDB

juliadb
#1

I am doing my best to analyse Fannie Mae data with JuliaDB. However the f

First Step is to download the Fannie Mae Data and unzip them all

Now I tried to load the data using loadtable but it didn’t work.

So I split the data into smaller chunks (around 250m each, so a 11G file would now be in ruoghly 44 chunks). I generated the commands to do that.

using Distributed, Statistics
addprocs(6)

@time @everywhere using JuliaDB, Dagger

datapath = "c:/data/Performance_All/"
ifiles = joinpath.(datapath, readdir(datapath))

fsz = (x->stat(x).size).(ifiles)

nchunks = Int.(ceil.(fsz./(250*1024*1024)))

##############################################################
# run the split code in bash BEFORE proceeding
###############################################################

for (in, out, n) in zip(ifiles, "c:/data/Performance_All_split/".*readdir(datapath), nchunks)
    run(`split $in -n l/$n $out`)
end

run(`split "c:/data/Performance_All/Performance_2000Q1.txt" -n l/3 -d "c:/data/Performance_All_split/Performance_2000Q1.txt"`)

###############################################################
# Specify the types of columns
###############################################################

const fmtypes = [
    String,                     Union{String, Missing},     Union{String, Missing},     Union{Float64, Missing},    Union{Float64, Missing},
    Union{Float64, Missing},    Union{Float64, Missing},    Union{Float64, Missing},    Union{String, Missing},     Union{String, Missing},
    Union{String, Missing},     Union{String, Missing},     Union{String, Missing},     Union{String, Missing},     Union{String, Missing},
    Union{String, Missing},     Union{String, Missing},     Union{Float64, Missing},    Union{Float64, Missing},    Union{Float64, Missing},
    Union{Float64, Missing},    Union{Float64, Missing},    Union{Float64, Missing},    Union{Float64, Missing},    Union{Float64, Missing},
    Union{Float64, Missing},    Union{Float64, Missing},    Union{Float64, Missing},    Union{String, Missing},     Union{Float64, Missing},
    Union{String, Missing}]

colnames = ["loan_id", "monthly_rpt_prd", "servicer_name", "last_rt", "last_upb", "loan_age",
    "months_to_legal_mat" , "adj_month_to_mat", "maturity_date", "msa", "delq_status",
    "mod_flag", "zero_bal_code", "zb_dte", "lpi_dte", "fcc_dte","disp_dt", "fcc_cost",
    "pp_cost", "ar_cost", "ie_cost", "tax_cost", "ns_procs", "ce_procs", "rmw_procs",
    "o_procs", "non_int_upb", "prin_forg_upb_fhfa", "repch_flag", "prin_forg_upb_oth",
    "transfer_flg"];
#datapath = "C:/data/Performance_All_split"
datapath = "C:/data/ok"
ifiles = joinpath.(datapath, readdir(datapath))

loading the data is fine at 0.5 seconds

@time jll = loadtable(
    ifiles,
    output = "c:/data/fm.jldb\\",
    delim='|',
    header_exists=false,
    filenamecol = "filename",
    chunks = length(ifiles),
    #type_detect_rows = 20000,
    colnames = colnames,
    colparsers = fmtypes,
    indexcols=["loan_id"]);

but displaying takes so long

jll # this takes literally hours!!!

jll[1] # this also takes hours (actually I don't know I shut it off after about 20mins)

Edit the issue is with the display as loadinig the data seems fine.

So I am bit stuck at the moment.

3 Likes
#2

Raise an issue at https://github.com/JuliaComputing/JuliaDB.jl/issues.

#3

Did that 3 days ago https://github.com/JuliaComputing/JuliaDB.jl/issues/259

#4

the final comment in that issue is

Just tagged a new release with the fix. I’m surprised the deprecated syntax survived this long/was missed by femtocleaner.

The issue you raise here is not the same as the issue you filed there. That issue was “Error loading Fannie Mae data”. This issue, if I read it correctly, is about the time required to display a large table, not about an error in loading it.

#5

Hmmm, i might have gotten it mixed it. Will look into it.