Trying to analyse Fannie Mae data with JuliaDB

xiaodai · March 3, 2019, 12:49am

I am doing my best to analyse Fannie Mae data with JuliaDB. However the f

First Step is to download the Fannie Mae Data and unzip them all

Now I tried to load the data using loadtable but it didn’t work.

So I split the data into smaller chunks (around 250m each, so a 11G file would now be in ruoghly 44 chunks). I generated the commands to do that.

using Distributed, Statistics
addprocs(6)

@time @everywhere using JuliaDB, Dagger

datapath = "c:/data/Performance_All/"
ifiles = joinpath.(datapath, readdir(datapath))

fsz = (x->stat(x).size).(ifiles)

nchunks = Int.(ceil.(fsz./(250*1024*1024)))

##############################################################
# run the split code in bash BEFORE proceeding
###############################################################

for (in, out, n) in zip(ifiles, "c:/data/Performance_All_split/".*readdir(datapath), nchunks)
    run(`split $in -n l/$n $out`)
end

run(`split "c:/data/Performance_All/Performance_2000Q1.txt" -n l/3 -d "c:/data/Performance_All_split/Performance_2000Q1.txt"`)

###############################################################
# Specify the types of columns
###############################################################

const fmtypes = [
    String,                     Union{String, Missing},     Union{String, Missing},     Union{Float64, Missing},    Union{Float64, Missing},
    Union{Float64, Missing},    Union{Float64, Missing},    Union{Float64, Missing},    Union{String, Missing},     Union{String, Missing},
    Union{String, Missing},     Union{String, Missing},     Union{String, Missing},     Union{String, Missing},     Union{String, Missing},
    Union{String, Missing},     Union{String, Missing},     Union{Float64, Missing},    Union{Float64, Missing},    Union{Float64, Missing},
    Union{Float64, Missing},    Union{Float64, Missing},    Union{Float64, Missing},    Union{Float64, Missing},    Union{Float64, Missing},
    Union{Float64, Missing},    Union{Float64, Missing},    Union{Float64, Missing},    Union{String, Missing},     Union{Float64, Missing},
    Union{String, Missing}]

colnames = ["loan_id", "monthly_rpt_prd", "servicer_name", "last_rt", "last_upb", "loan_age",
    "months_to_legal_mat" , "adj_month_to_mat", "maturity_date", "msa", "delq_status",
    "mod_flag", "zero_bal_code", "zb_dte", "lpi_dte", "fcc_dte","disp_dt", "fcc_cost",
    "pp_cost", "ar_cost", "ie_cost", "tax_cost", "ns_procs", "ce_procs", "rmw_procs",
    "o_procs", "non_int_upb", "prin_forg_upb_fhfa", "repch_flag", "prin_forg_upb_oth",
    "transfer_flg"];
#datapath = "C:/data/Performance_All_split"
datapath = "C:/data/ok"
ifiles = joinpath.(datapath, readdir(datapath))

loading the data is fine at 0.5 seconds

@time jll = loadtable(
    ifiles,
    output = "c:/data/fm.jldb\\",
    delim='|',
    header_exists=false,
    filenamecol = "filename",
    chunks = length(ifiles),
    #type_detect_rows = 20000,
    colnames = colnames,
    colparsers = fmtypes,
    indexcols=["loan_id"]);

but displaying takes so long

jll # this takes literally hours!!!

jll[1] # this also takes hours (actually I don't know I shut it off after about 20mins)

Edit the issue is with the display as loadinig the data seems fine.

So I am bit stuck at the moment.

JeffreySarnoff · March 6, 2019, 4:03am

Raise an issue at Issues · JuliaData/JuliaDB.jl · GitHub.

xiaodai · March 6, 2019, 4:11am

Did that 3 days ago Error loading Fannie Mae data · Issue #259 · JuliaData/JuliaDB.jl · GitHub

JeffreySarnoff · March 6, 2019, 6:20am

the final comment in that issue is

Just tagged a new release with the fix. I’m surprised the deprecated syntax survived this long/was missed by femtocleaner.

The issue you raise here is not the same as the issue you filed there. That issue was “Error loading Fannie Mae data”. This issue, if I read it correctly, is about the time required to display a large table, not about an error in loading it.

xiaodai · March 6, 2019, 6:24am

Hmmm, i might have gotten it mixed it. Will look into it.

xiaodai · August 31, 2019, 11:52am

I have updated to very small self-contained MWE. I can’t get JuliaDB working at all. The data is from the Fannie Mae dataset.

using JuliaDB, Dagger

##############################################################
# Download & Extract data
###############################################################

;wget https://raw.githubusercontent.com/xiaodaigh/JuliaDB.jl/master/ok.csv

##############################################################
# Specify the types of columns
###############################################################

fmtypes = [
    Int64,                     String,     Union{String, Missing},     Union{Float64, Missing},    Union{Float64, Missing},
    Union{Float64, Missing},    Union{Float64, Missing},    Union{Float64, Missing},    Union{String, Missing},     Union{String, Missing},
    Union{String, Missing},     Union{String, Missing},     Union{String, Missing},     Union{String, Missing},     Union{String, Missing},
    Union{String, Missing},     Union{String, Missing},     Union{Float64, Missing},    Union{Float64, Missing},    Union{Float64, Missing},
    Union{Float64, Missing},    Union{Float64, Missing},    Union{Float64, Missing},    Union{Float64, Missing},    Union{Float64, Missing},
    Union{Float64, Missing},    Union{Float64, Missing},    Union{Float64, Missing},    Union{String, Missing},     Union{Float64, Missing},
    Union{String, Missing}]


# colnames = ["loan_id", "monthly_rpt_prd", "servicer_name", "last_rt", "last_upb", "loan_age",
#     "months_to_legal_mat" , "adj_month_to_mat", "maturity_date", "msa", "delq_status",
#     "mod_flag", "zero_bal_code", "zb_dte", "lpi_dte", "fcc_dte","disp_dt", "fcc_cost",
#     "pp_cost", "ar_cost", "ie_cost", "tax_cost", "ns_procs", "ce_procs", "rmw_procs",
#     "o_procs", "non_int_upb", "prin_forg_upb_fhfa", "repch_flag", "prin_forg_upb_oth",
#     "transfer_flg"];

@time jll = loadtable(
    "c:/data/perf-jld/ok.csv",
    output = "c:/data/fm.jldb/",
    delim=',',
    header_exists=true,
    #filenamecol = "filename",
    #chunks = length(ifiles),
    #type_detect_rows = 20_000,
    # colnames = colnames,
    colparsers = fmtypes,
    indexcols=["Column1"]);

anon67531922 · August 31, 2019, 12:09pm

I’m pretty shocked that the Julia community, that is usually pretty helpful, has left this person struggling for 6 months. Could someone from JuliaDB help this person please?

xiaodai · August 31, 2019, 12:19pm

Hopefully, it gets easier. It used to be that Fannie Mae data was behind registration wall (free but needs registration). Now it’s available with wget from Rapids.ai (see https://docs.rapids.ai/datasets/mortgage-data), so it’s easier to make MWE.

But the datasets are huge and being able to handle huge datasets is the point fo rJuliaDB, so I think having a set of working code for Fannie Mae would be a great resource for the community.

I have submitted a few issues so hopefully the developer will get to it at some point Issues · JuliaData/JuliaDB.jl · GitHub

I admit I am not able to commit to digging through the code and debugging it as I have other OSS commitments and a full time (no-Julia) job.

On the JuliaDB website, it clears shows that it’s working for one set of data (which is 74GB), so it’s working for something. It’s not mature enough to have guaranteed support of other datasets.

I see that JuliaDB isn’t one of the curated packages in JuliaPro, so I assume Julia Computing isn’t putting resources into building it up at the moment. Source: https://juliacomputing.com/products/juliapro

StefanKarpinski · August 31, 2019, 3:39pm

We were hoping for some commercial traction around JuliaDB but that hasn’t materialized so until someone wants to pay for it, it’s very much on a back burner. It was updated for 1.0 and should work on recent Julias though. If anyone else wants to drive maintenance of the project, updates and fixes would certainly be welcomed.

Zach_Christensen · August 31, 2019, 3:53pm

Does that mean all the other packages officially supported are financially supported somehow?

StefanKarpinski · August 31, 2019, 4:16pm

If a Julia Computing customer who is paying for support is using a package then maintaining it is prioritized and any issues they specifically encounter are fixed right away. If no paying customer needs something fixed then it happens on a best effort basis—if and when someone has time to do it. If you’re not a paying customer then you can use it or not but there’s no maintenance guarantees for non-customers. If you or @xiaodai or @anon67531922 want to purchase JuliaSure, we would be happy to get this issue—and any others you might have—sorted out ASAP.

Zach_Christensen · August 31, 2019, 5:10pm

Wish I was in a position to do something like that. It may be worth trying to sell it to some NIH projects right now. The current director really pushes big data collection which results in a completely reinvented database system for each project. With Julia’s ability to use AbstractArray interfaces for so many diverse data types it poses a really opportunity for a centralized database API.

StefanKarpinski · August 31, 2019, 5:12pm

We do want to return to JuliaDB in the future, but we have to focus on what people are actually paying us for. I do still think it has great promise but it doesn’t match the usability of DataFrames right now.

Yifan_Liu · August 31, 2019, 6:54pm

The only working example of JuliaDB on large datasets is the TrueFX example:

github.com

JuliaComputing/TrueFX/blob/master/JuliaDB with TrueFX dataset.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Playing with TrueFX data and JuliaDB\n",
    "\n",
    "This is an example workflow for loading, working with and saving datasets in Julia. There are a few packages for working with data, JuliaDB developed by JuliaComputing is one of them.\n",
    "\n",
    "Other popular alternatives are:\n",
    "\n",
    "- DataFrames.jl\n",
    "- Query.jl\n",
    "- TypedTables.jl"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,

This file has been truncated. show original

However, I found its workflow problematic. Instead of loading so many small csv files into JuliaDB, it is much easier and faster to use bash command to combine these csv files into one large csv file and then load it to JuliaDB.

However, when I try to load the combined large csv file, JuliaDB simply hangs and stops responding.

xiaodai · August 31, 2019, 11:27pm

Yeah. Juliadb chunks correspond to the number of files. So one big Woy od be impossible for now unless you enough RAM to load it

Topic		Replies	Views
JuliaDB loadndsparse: many errors General Usage	15	773	October 24, 2019
ANN: JuliaDB.jl Community	40	9787	November 13, 2018
Why do you use JuliaDB? General Usage	9	2156	October 28, 2019
JuliaDB Getting Started...with save error Data	6	1521	November 28, 2018
My experiences reading CSVs from the Fannie Mae datasets Data performance , csv	62	6267	August 26, 2019

Trying to analyse Fannie Mae data with JuliaDB

Related topics