Trying to analyse Fannie Mae data with JuliaDB

I am doing my best to analyse Fannie Mae data with JuliaDB. However the f

First Step is to download the Fannie Mae Data and unzip them all

Now I tried to load the data using loadtable but it didn’t work.

So I split the data into smaller chunks (around 250m each, so a 11G file would now be in ruoghly 44 chunks). I generated the commands to do that.

using Distributed, Statistics
addprocs(6)

@time @everywhere using JuliaDB, Dagger

datapath = "c:/data/Performance_All/"
ifiles = joinpath.(datapath, readdir(datapath))

fsz = (x->stat(x).size).(ifiles)

nchunks = Int.(ceil.(fsz./(250*1024*1024)))

##############################################################
# run the split code in bash BEFORE proceeding
###############################################################

for (in, out, n) in zip(ifiles, "c:/data/Performance_All_split/".*readdir(datapath), nchunks)
    run(`split $in -n l/$n $out`)
end

run(`split "c:/data/Performance_All/Performance_2000Q1.txt" -n l/3 -d "c:/data/Performance_All_split/Performance_2000Q1.txt"`)

###############################################################
# Specify the types of columns
###############################################################

const fmtypes = [
    String,                     Union{String, Missing},     Union{String, Missing},     Union{Float64, Missing},    Union{Float64, Missing},
    Union{Float64, Missing},    Union{Float64, Missing},    Union{Float64, Missing},    Union{String, Missing},     Union{String, Missing},
    Union{String, Missing},     Union{String, Missing},     Union{String, Missing},     Union{String, Missing},     Union{String, Missing},
    Union{String, Missing},     Union{String, Missing},     Union{Float64, Missing},    Union{Float64, Missing},    Union{Float64, Missing},
    Union{Float64, Missing},    Union{Float64, Missing},    Union{Float64, Missing},    Union{Float64, Missing},    Union{Float64, Missing},
    Union{Float64, Missing},    Union{Float64, Missing},    Union{Float64, Missing},    Union{String, Missing},     Union{Float64, Missing},
    Union{String, Missing}]

colnames = ["loan_id", "monthly_rpt_prd", "servicer_name", "last_rt", "last_upb", "loan_age",
    "months_to_legal_mat" , "adj_month_to_mat", "maturity_date", "msa", "delq_status",
    "mod_flag", "zero_bal_code", "zb_dte", "lpi_dte", "fcc_dte","disp_dt", "fcc_cost",
    "pp_cost", "ar_cost", "ie_cost", "tax_cost", "ns_procs", "ce_procs", "rmw_procs",
    "o_procs", "non_int_upb", "prin_forg_upb_fhfa", "repch_flag", "prin_forg_upb_oth",
    "transfer_flg"];
#datapath = "C:/data/Performance_All_split"
datapath = "C:/data/ok"
ifiles = joinpath.(datapath, readdir(datapath))

loading the data is fine at 0.5 seconds

@time jll = loadtable(
    ifiles,
    output = "c:/data/fm.jldb\\",
    delim='|',
    header_exists=false,
    filenamecol = "filename",
    chunks = length(ifiles),
    #type_detect_rows = 20000,
    colnames = colnames,
    colparsers = fmtypes,
    indexcols=["loan_id"]);

but displaying takes so long

jll # this takes literally hours!!!

jll[1] # this also takes hours (actually I don't know I shut it off after about 20mins)

Edit the issue is with the display as loadinig the data seems fine.

So I am bit stuck at the moment.

3 Likes

Raise an issue at https://github.com/JuliaComputing/JuliaDB.jl/issues.

Did that 3 days ago https://github.com/JuliaComputing/JuliaDB.jl/issues/259

the final comment in that issue is

Just tagged a new release with the fix. I’m surprised the deprecated syntax survived this long/was missed by femtocleaner.

The issue you raise here is not the same as the issue you filed there. That issue was “Error loading Fannie Mae data”. This issue, if I read it correctly, is about the time required to display a large table, not about an error in loading it.

Hmmm, i might have gotten it mixed it. Will look into it.

I have updated to very small self-contained MWE. I can’t get JuliaDB working at all. The data is from the Fannie Mae dataset.

using JuliaDB, Dagger

##############################################################
# Download & Extract data
###############################################################

;wget https://raw.githubusercontent.com/xiaodaigh/JuliaDB.jl/master/ok.csv

##############################################################
# Specify the types of columns
###############################################################

fmtypes = [
    Int64,                     String,     Union{String, Missing},     Union{Float64, Missing},    Union{Float64, Missing},
    Union{Float64, Missing},    Union{Float64, Missing},    Union{Float64, Missing},    Union{String, Missing},     Union{String, Missing},
    Union{String, Missing},     Union{String, Missing},     Union{String, Missing},     Union{String, Missing},     Union{String, Missing},
    Union{String, Missing},     Union{String, Missing},     Union{Float64, Missing},    Union{Float64, Missing},    Union{Float64, Missing},
    Union{Float64, Missing},    Union{Float64, Missing},    Union{Float64, Missing},    Union{Float64, Missing},    Union{Float64, Missing},
    Union{Float64, Missing},    Union{Float64, Missing},    Union{Float64, Missing},    Union{String, Missing},     Union{Float64, Missing},
    Union{String, Missing}]


# colnames = ["loan_id", "monthly_rpt_prd", "servicer_name", "last_rt", "last_upb", "loan_age",
#     "months_to_legal_mat" , "adj_month_to_mat", "maturity_date", "msa", "delq_status",
#     "mod_flag", "zero_bal_code", "zb_dte", "lpi_dte", "fcc_dte","disp_dt", "fcc_cost",
#     "pp_cost", "ar_cost", "ie_cost", "tax_cost", "ns_procs", "ce_procs", "rmw_procs",
#     "o_procs", "non_int_upb", "prin_forg_upb_fhfa", "repch_flag", "prin_forg_upb_oth",
#     "transfer_flg"];

@time jll = loadtable(
    "c:/data/perf-jld/ok.csv",
    output = "c:/data/fm.jldb/",
    delim=',',
    header_exists=true,
    #filenamecol = "filename",
    #chunks = length(ifiles),
    #type_detect_rows = 20_000,
    # colnames = colnames,
    colparsers = fmtypes,
    indexcols=["Column1"]);

I’m pretty shocked that the Julia community, that is usually pretty helpful, has left this person struggling for 6 months. Could someone from JuliaDB help this person please?

1 Like

Hopefully, it gets easier. It used to be that Fannie Mae data was behind registration wall (free but needs registration). Now it’s available with wget from Rapids.ai (see https://docs.rapids.ai/datasets/mortgage-data), so it’s easier to make MWE.

But the datasets are huge and being able to handle huge datasets is the point fo rJuliaDB, so I think having a set of working code for Fannie Mae would be a great resource for the community.

I have submitted a few issues so hopefully the developer will get to it at some point https://github.com/JuliaComputing/JuliaDB.jl/issues/created_by/xiaodaigh

I admit I am not able to commit to digging through the code and debugging it as I have other OSS commitments and a full time (no-Julia) job.

On the JuliaDB website, it clears shows that it’s working for one set of data (which is 74GB), so it’s working for something. It’s not mature enough to have guaranteed support of other datasets.

I see that JuliaDB isn’t one of the curated packages in JuliaPro, so I assume Julia Computing isn’t putting resources into building it up at the moment. Source: https://juliacomputing.com/products/juliapro

1 Like

We were hoping for some commercial traction around JuliaDB but that hasn’t materialized so until someone wants to pay for it, it’s very much on a back burner. It was updated for 1.0 and should work on recent Julias though. If anyone else wants to drive maintenance of the project, updates and fixes would certainly be welcomed.

Does that mean all the other packages officially supported are financially supported somehow?

If a Julia Computing customer who is paying for support is using a package then maintaining it is prioritized and any issues they specifically encounter are fixed right away. If no paying customer needs something fixed then it happens on a best effort basis—if and when someone has time to do it. If you’re not a paying customer then you can use it or not but there’s no maintenance guarantees for non-customers. If you or @xiaodai or @anon67531922 want to purchase JuliaSure, we would be happy to get this issue—and any others you might have—sorted out ASAP.

2 Likes

Wish I was in a position to do something like that. It may be worth trying to sell it to some NIH projects right now. The current director really pushes big data collection which results in a completely reinvented database system for each project. With Julia’s ability to use AbstractArray interfaces for so many diverse data types it poses a really opportunity for a centralized database API.

We do want to return to JuliaDB in the future, but we have to focus on what people are actually paying us for. I do still think it has great promise but it doesn’t match the usability of DataFrames right now.

The only working example of JuliaDB on large datasets is the TrueFX example:

However, I found its workflow problematic. Instead of loading so many small csv files into JuliaDB, it is much easier and faster to use bash command to combine these csv files into one large csv file and then load it to JuliaDB.

However, when I try to load the combined large csv file, JuliaDB simply hangs and stops responding.

Yeah. Juliadb chunks correspond to the number of files. So one big Woy od be impossible for now unless you enough RAM to load it