My experiences reading CSVs from the Fannie Mae datasets

sdanisch · March 21, 2019, 4:02pm

If you set fread(path, nThreads = 1) I get ~4s, while I get ~6s with TableReader… So not that much worse
I wouldn’t be surprised, if it’s relatively easy to process chunks in threads with TableReader as well!

xiaodai · March 21, 2019, 4:07pm

Last time i checked tablereader failed on Windows.

xiaodai · March 23, 2019, 10:48pm

Wow! Now TableReader is by far the best performer! But it’s a still a far cry from R’s data.table::fread at 2.5 seconds. Perhaps adding multithreading is the key then.

# download("https://github.com/xiaodaigh/testing/raw/master/Performance_2016Q1.zip", "ok.zip")
# run(`unzip -o ok.zip`)
using TableReader
path = "Performance_2016Q1.csv"
@time a = readcsv(path, delim = '|', hasheader = false); # 12~15 seconds

bicycle1885 · March 24, 2019, 4:05am

I tried fread and it was unbelievably fast! But I think parallel parsing (when Julia incorporates the parallel task runtime into its core) and other minor improvements will close the gap and make TableReader.jl more competitive in this CSV parser race.

xiaodai · March 24, 2019, 4:12am

Thank you for your great package. I am going to teach your package for sure!

dmbates · March 25, 2019, 3:27pm

Going back to the question of why Feather.read is faster than rget, by default the .feather file is memory-mapped so the overhead is very low and there is no copying of the data in memory.

When I tried to reproduce the feather::write_feather... sequence in R I got a warning about not having the bit64 package available which will mean that 64-bit integers are displayed as weird-looking floating point numbers. I believe this is why reading the file into Julia produces unusual values in the V1 column

julia> a = Feather.read("/home/bates/Performance_2016Q1.feather")
6520505×31 DataFrames.DataFrame. Omitted printing of 16 columns
│ Row     │ V1           │ V2         │ V3     │ V4      │ V5        │ V6    │ V7    │ V8     │ V9      │ V10   │ V11    │ V12    │ V13     │ V14    │ V15    │
│         │ Float64      │ String     │ String │ Float64 │ Float64⍰  │ Int32 │ Int32 │ Int32⍰ │ String  │ Int32 │ String │ String │ Int32⍰  │ String │ String │
├─────────┼──────────────┼────────────┼────────┼─────────┼───────────┼───────┼───────┼────────┼─────────┼───────┼────────┼────────┼─────────┼────────┼────────┤
│ 1       │ 4.94068e-313 │ 02/01/2016 │ OTHER  │ 3.75    │ missing   │ 1     │ 359   │ 359    │ 01/2046 │ 12260 │ 0      │ N      │ missing │        │        │
│ 2       │ 4.94068e-313 │ 03/01/2016 │        │ 3.75    │ missing   │ 2     │ 358   │ 357    │ 01/2046 │ 12260 │ 0      │ N      │ missing │        │        │
│ 3       │ 4.94068e-313 │ 04/01/2016 │        │ 3.75    │ missing   │ 3     │ 357   │ 356    │ 01/2046 │ 12260 │ 0      │ N      │ missing │        │        │
│ 4       │ 4.94068e-313 │ 05/01/2016 │        │ 3.75    │ missing   │ 4     │ 356   │ 355    │ 01/2046 │ 12260 │ 0      │ N      │ missing │        │        │
│ 5       │ 4.94068e-313 │ 06/01/2016 │        │ 3.75    │ missing   │ 5     │ 355   │ 354    │ 01/2046 │ 12260 │ 0      │ N      │ missing │        │        │
│ 6       │ 4.94068e-313 │ 07/01/2016 │        │ 3.75    │ missing   │ 6     │ 354   │ 353    │ 01/2046 │ 12260 │ 0      │ N      │ missing │        │        │
│ 7       │ 4.94068e-313 │ 08/01/2016 │        │ 3.75    │ 64208.1   │ 7     │ 353   │ 352    │ 01/2046 │ 12260 │ 0      │ N      │ missing │        │        │
│ 8       │ 4.94068e-313 │ 09/01/2016 │        │ 3.75    │ 64107.8   │ 8     │ 352   │ 351    │ 01/2046 │ 12260 │ 0      │ N      │ missing │        │        │
│ 9       │ 4.94068e-313 │ 10/01/2016 │        │ 3.75    │ 64006.4   │ 9     │ 351   │ 350    │ 01/2046 │ 12260 │ 0      │ N      │ missing │        │        │
⋮
│ 6520496 │ 4.94062e-312 │ 09/01/2016 │        │ 3.5     │ 2.38727e5 │ 5     │ 355   │ 345    │ 04/2046 │ 42200 │ 0      │ N      │ missing │        │        │
│ 6520497 │ 4.94062e-312 │ 10/01/2016 │        │ 3.5     │ 2.37849e5 │ 6     │ 354   │ 343    │ 04/2046 │ 42200 │ 0      │ N      │ missing │        │        │
│ 6520498 │ 4.94062e-312 │ 11/01/2016 │        │ 3.5     │ 2.36968e5 │ 7     │ 353   │ 341    │ 04/2046 │ 42200 │ 0      │ N      │ missing │        │        │
│ 6520499 │ 4.94062e-312 │ 12/01/2016 │        │ 3.5     │ 2.36087e5 │ 8     │ 352   │ 339    │ 04/2046 │ 42200 │ 0      │ N      │ missing │        │        │
│ 6520500 │ 4.94062e-312 │ 01/01/2017 │        │ 3.5     │ 2.35204e5 │ 9     │ 351   │ 336    │ 04/2046 │ 42200 │ 0      │ N      │ missing │        │        │
│ 6520501 │ 4.94062e-312 │ 02/01/2017 │        │ 3.5     │ 2.34317e5 │ 10    │ 350   │ 334    │ 04/2046 │ 42200 │ 0      │ N      │ missing │        │        │
│ 6520502 │ 4.94062e-312 │ 03/01/2017 │        │ 3.5     │ 2.33429e5 │ 11    │ 349   │ 332    │ 04/2046 │ 42200 │ 0      │ N      │ missing │        │        │
│ 6520503 │ 4.94062e-312 │ 04/01/2017 │        │ 3.5     │ 2.32537e5 │ 12    │ 348   │ 330    │ 04/2046 │ 42200 │ 0      │ N      │ missing │        │        │
│ 6520504 │ 4.94062e-312 │ 05/01/2017 │        │ 3.5     │ 2.31643e5 │ 13    │ 347   │ 328    │ 04/2046 │ 42200 │ 0      │ N      │ missing │        │        │
│ 6520505 │ 4.94062e-312 │ 06/01/2017 │        │ 3.5     │ 2.30747e5 │ 14    │ 346   │ 326    │ 04/2046 │ 42200 │ 0      │ N      │ missing │        │        │

Also, julia takes an error exit if, for example, I try describe(a).

pdeffebach · March 25, 2019, 3:31pm

can you post an issue in DataFrames for the describe error? we have tons of try-catch statements there to try prevent any error like that.

nalimilan · March 25, 2019, 3:42pm

When you say “julia takes an error exit”, you mean it crashes, right? Then it’s probably not a bug in describe but in Feather.jl.

dmbates · March 25, 2019, 4:02pm

Yes, Julia crashes. I believe the problem is in Feather.read or perhaps the feather file written by the R packages is corrupt.

xiaodai · March 25, 2019, 10:51pm

Actually, when I use the data I actually supply the full list of column types. The read speed is not that different in that case in data.table. Actually the first column should be read as a string according to the official tutorial on Fannie Mae’s website.

xiaodai · April 23, 2019, 12:30pm

xiaodai:

# download("https://github.com/xiaodaigh/testing/raw/master/Performance_2000Q1.zip", "ok.zip") 
# run(`unzip -o ok.zip`) 
using TableReader
path = "Performance_2000Q1.txt"
@time a = readcsv(path, delim = '|', hasheader = false);

I tried TableReader on a slightly larger file from the Fannie Mae dataset and it took 3 times longers when the files is only 200mb larger.

xiaodai · May 6, 2019, 2:26pm

This has now been fixed and I am happy to report that CSV.jl is now even faster than TableReader.jl on this dataset now with minimal “hand-holding” i.e. I don’t need to specify too much except the delim and header

using CSV
@time  a = CSV.read(path, delim='|', header =0)

CSV.jl is also only about twice as slow as R’s data.table::fread even though fread is multi-threaded.

quinnj · May 6, 2019, 4:36pm

Note that you don’t even need to specify the delim='|' any more as it’s detected automatically

Juan · June 24, 2019, 11:22am

I thought JuliaDB was designed for out-of-core tasks.

At the end… Did you do any benchmark with different packages or options?

xiaodai · August 24, 2019, 2:24am

I am pretty excited! CSV.jl has gotten to the point where it beats using RCall.jl and data.table::fread hands down!

I can read the Fannie Data alot faster now! I can read the smallest file from Fannie Mae which is about 500mb in size using CSV.read on Julia 1.3 in about 5 seconds (11s including compilation), but RCall.jl and fread is up wards of 20 seconds (on Julia 1.2 as Rcall.jl isn’t working for me on 1.3).

However, pure fread is just under 3s, so is still faster than CSV.jl, but it’s a point where I wouldn’t reach for R just because of speed!

I can also load a 7GB dataset in about 320s with CSV.jl with threaded=true, however it’s only 42 seconds with fread. So for large datasets, there is still a gap in performance for large data.

xiaodai · August 25, 2019, 2:14pm

I tested reading every file of Fannie Mae’s data (from Fannie Mae Single-Family Loan Performance Data | Fannie Mae and you need to register to download, but this is one of the best open data sources).

This is the timings I got from CSV.read vs data.table::fread on my computer with 64G RAM. I only recorded the timings once, but the point is not absolute precision here

I have to say Julia CSV parsing is way better now vs before! Thanks to @quinnj’s great contributions! Even though data.table::fread is still better for the Fannie Mae case, I think it’s gotten to the point where I wouldn’t reach for R straight-away. The type-inferencing and auto-detection of delimiter are pretty awesome in CSV.jl at this time.
To put things into perspective, let’s consider one of the most popular CSV readers in the R-sphere, reader::read_csv. It can’t detect the delimiter (duh, they would say CSV means comma separated). Also, it is roughly 2x slower than CSV.jl! Every time I see a post/tweet recommending readr::read_csv, I die a little. Clearly, data.table::fread is the one to beat! So I’ve been spamming the various CSV readers in the Julia-verse with my comments about Fannie Mae and fread. In Python, pandas.read_csv is fairly competent, but can’t detect the delimiter correctly, and is slower than CSV.jl. Another new thing in Python-land is pyarrow which is quite fast, but if you convert the pyarrow.Table to pandas dataframe then it’s still slower than CSV.read which reads data and converts it to a DataFrame.

Curiously, I can get TextParse.jl to read the CSVs but not JuliaDB.jl.

xiaodai · August 25, 2019, 2:26pm

Wow. This no longer works… Gotta to report a bug.

favba · August 25, 2019, 5:06pm

Are the axes inverted? Y-axis should be time and X-axis file size?

davidanthoff · August 25, 2019, 7:12pm

You need to use capital letters for the format specifier, and delim is not a keyword argument. load(File(format"CSV", file_path), '|', type_detect_rows = 1000) is the correct syntax here.

xiaodai · August 25, 2019, 10:11pm

Yeah

Topic		Replies	Views
[ANN] Fread.jl - read CSVs faster with the help of R's {data.table} Package Announcements performance , data , csv	6	2046	October 9, 2019
Trying to analyse Fannie Mae data with JuliaDB Data juliadb	14	1338	August 31, 2019
Reading Data Is Still Too Slow Data	35	8803	August 2, 2019
Why do you use JuliaDB? General Usage	9	2136	October 28, 2019
CSV Reader Benchmarks: Julia Reads CSVs 10-20x Faster than Python and R General Usage announcement	68	8842	March 23, 2022

My experiences reading CSVs from the Fannie Mae datasets

Related topics