I am doing my best to analyse Fannie Mae data with JuliaDB. However the f
First Step is to download the Fannie Mae Data and unzip them all
Now I tried to load the data using loadtable
but it didn’t work.
So I split the data into smaller chunks (around 250m each, so a 11G file would now be in ruoghly 44 chunks). I generated the commands to do that.
using Distributed, Statistics
addprocs(6)
@time @everywhere using JuliaDB, Dagger
datapath = "c:/data/Performance_All/"
ifiles = joinpath.(datapath, readdir(datapath))
fsz = (x->stat(x).size).(ifiles)
nchunks = Int.(ceil.(fsz./(250*1024*1024)))
##############################################################
# run the split code in bash BEFORE proceeding
###############################################################
for (in, out, n) in zip(ifiles, "c:/data/Performance_All_split/".*readdir(datapath), nchunks)
run(`split $in -n l/$n $out`)
end
run(`split "c:/data/Performance_All/Performance_2000Q1.txt" -n l/3 -d "c:/data/Performance_All_split/Performance_2000Q1.txt"`)
###############################################################
# Specify the types of columns
###############################################################
const fmtypes = [
String, Union{String, Missing}, Union{String, Missing}, Union{Float64, Missing}, Union{Float64, Missing},
Union{Float64, Missing}, Union{Float64, Missing}, Union{Float64, Missing}, Union{String, Missing}, Union{String, Missing},
Union{String, Missing}, Union{String, Missing}, Union{String, Missing}, Union{String, Missing}, Union{String, Missing},
Union{String, Missing}, Union{String, Missing}, Union{Float64, Missing}, Union{Float64, Missing}, Union{Float64, Missing},
Union{Float64, Missing}, Union{Float64, Missing}, Union{Float64, Missing}, Union{Float64, Missing}, Union{Float64, Missing},
Union{Float64, Missing}, Union{Float64, Missing}, Union{Float64, Missing}, Union{String, Missing}, Union{Float64, Missing},
Union{String, Missing}]
colnames = ["loan_id", "monthly_rpt_prd", "servicer_name", "last_rt", "last_upb", "loan_age",
"months_to_legal_mat" , "adj_month_to_mat", "maturity_date", "msa", "delq_status",
"mod_flag", "zero_bal_code", "zb_dte", "lpi_dte", "fcc_dte","disp_dt", "fcc_cost",
"pp_cost", "ar_cost", "ie_cost", "tax_cost", "ns_procs", "ce_procs", "rmw_procs",
"o_procs", "non_int_upb", "prin_forg_upb_fhfa", "repch_flag", "prin_forg_upb_oth",
"transfer_flg"];
#datapath = "C:/data/Performance_All_split"
datapath = "C:/data/ok"
ifiles = joinpath.(datapath, readdir(datapath))
loading the data is fine at 0.5 seconds
@time jll = loadtable(
ifiles,
output = "c:/data/fm.jldb\\",
delim='|',
header_exists=false,
filenamecol = "filename",
chunks = length(ifiles),
#type_detect_rows = 20000,
colnames = colnames,
colparsers = fmtypes,
indexcols=["loan_id"]);
but displaying takes so long
jll # this takes literally hours!!!
jll[1] # this also takes hours (actually I don't know I shut it off after about 20mins)
Edit the issue is with the display as loadinig the data seems fine.
So I am bit stuck at the moment.