My experiences reading CSVs from the Fannie Mae datasets

If you set fread(path, nThreads = 1) I get ~4s, while I get ~6s with TableReaderโ€ฆ So not that much worse :wink:
I wouldnโ€™t be surprised, if itโ€™s relatively easy to process chunks in threads with TableReader as well!

Last time i checked tablereader failed on Windows.

Wow! Now TableReader is by far the best performer! But itโ€™s a still a far cry from Rโ€™s data.table::fread at 2.5 seconds. Perhaps adding multithreading is the key then.

# download("https://github.com/xiaodaigh/testing/raw/master/Performance_2016Q1.zip", "ok.zip")
# run(`unzip -o ok.zip`)
using TableReader
path = "Performance_2016Q1.csv"
@time a = readcsv(path, delim = '|', hasheader = false); # 12~15 seconds

I tried fread and it was unbelievably fast! But I think parallel parsing (when Julia incorporates the parallel task runtime into its core) and other minor improvements will close the gap and make TableReader.jl more competitive in this CSV parser race.

3 Likes

Thank you for your great package. I am going to teach your package for sure!

3 Likes

Going back to the question of why Feather.read is faster than rget, by default the .feather file is memory-mapped so the overhead is very low and there is no copying of the data in memory.

When I tried to reproduce the feather::write_feather... sequence in R I got a warning about not having the bit64 package available which will mean that 64-bit integers are displayed as weird-looking floating point numbers. I believe this is why reading the file into Julia produces unusual values in the V1 column

julia> a = Feather.read("/home/bates/Performance_2016Q1.feather")
6520505ร—31 DataFrames.DataFrame. Omitted printing of 16 columns
โ”‚ Row     โ”‚ V1           โ”‚ V2         โ”‚ V3     โ”‚ V4      โ”‚ V5        โ”‚ V6    โ”‚ V7    โ”‚ V8     โ”‚ V9      โ”‚ V10   โ”‚ V11    โ”‚ V12    โ”‚ V13     โ”‚ V14    โ”‚ V15    โ”‚
โ”‚         โ”‚ Float64      โ”‚ String     โ”‚ String โ”‚ Float64 โ”‚ Float64โฐ  โ”‚ Int32 โ”‚ Int32 โ”‚ Int32โฐ โ”‚ String  โ”‚ Int32 โ”‚ String โ”‚ String โ”‚ Int32โฐ  โ”‚ String โ”‚ String โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ 1       โ”‚ 4.94068e-313 โ”‚ 02/01/2016 โ”‚ OTHER  โ”‚ 3.75    โ”‚ missing   โ”‚ 1     โ”‚ 359   โ”‚ 359    โ”‚ 01/2046 โ”‚ 12260 โ”‚ 0      โ”‚ N      โ”‚ missing โ”‚        โ”‚        โ”‚
โ”‚ 2       โ”‚ 4.94068e-313 โ”‚ 03/01/2016 โ”‚        โ”‚ 3.75    โ”‚ missing   โ”‚ 2     โ”‚ 358   โ”‚ 357    โ”‚ 01/2046 โ”‚ 12260 โ”‚ 0      โ”‚ N      โ”‚ missing โ”‚        โ”‚        โ”‚
โ”‚ 3       โ”‚ 4.94068e-313 โ”‚ 04/01/2016 โ”‚        โ”‚ 3.75    โ”‚ missing   โ”‚ 3     โ”‚ 357   โ”‚ 356    โ”‚ 01/2046 โ”‚ 12260 โ”‚ 0      โ”‚ N      โ”‚ missing โ”‚        โ”‚        โ”‚
โ”‚ 4       โ”‚ 4.94068e-313 โ”‚ 05/01/2016 โ”‚        โ”‚ 3.75    โ”‚ missing   โ”‚ 4     โ”‚ 356   โ”‚ 355    โ”‚ 01/2046 โ”‚ 12260 โ”‚ 0      โ”‚ N      โ”‚ missing โ”‚        โ”‚        โ”‚
โ”‚ 5       โ”‚ 4.94068e-313 โ”‚ 06/01/2016 โ”‚        โ”‚ 3.75    โ”‚ missing   โ”‚ 5     โ”‚ 355   โ”‚ 354    โ”‚ 01/2046 โ”‚ 12260 โ”‚ 0      โ”‚ N      โ”‚ missing โ”‚        โ”‚        โ”‚
โ”‚ 6       โ”‚ 4.94068e-313 โ”‚ 07/01/2016 โ”‚        โ”‚ 3.75    โ”‚ missing   โ”‚ 6     โ”‚ 354   โ”‚ 353    โ”‚ 01/2046 โ”‚ 12260 โ”‚ 0      โ”‚ N      โ”‚ missing โ”‚        โ”‚        โ”‚
โ”‚ 7       โ”‚ 4.94068e-313 โ”‚ 08/01/2016 โ”‚        โ”‚ 3.75    โ”‚ 64208.1   โ”‚ 7     โ”‚ 353   โ”‚ 352    โ”‚ 01/2046 โ”‚ 12260 โ”‚ 0      โ”‚ N      โ”‚ missing โ”‚        โ”‚        โ”‚
โ”‚ 8       โ”‚ 4.94068e-313 โ”‚ 09/01/2016 โ”‚        โ”‚ 3.75    โ”‚ 64107.8   โ”‚ 8     โ”‚ 352   โ”‚ 351    โ”‚ 01/2046 โ”‚ 12260 โ”‚ 0      โ”‚ N      โ”‚ missing โ”‚        โ”‚        โ”‚
โ”‚ 9       โ”‚ 4.94068e-313 โ”‚ 10/01/2016 โ”‚        โ”‚ 3.75    โ”‚ 64006.4   โ”‚ 9     โ”‚ 351   โ”‚ 350    โ”‚ 01/2046 โ”‚ 12260 โ”‚ 0      โ”‚ N      โ”‚ missing โ”‚        โ”‚        โ”‚
โ‹ฎ
โ”‚ 6520496 โ”‚ 4.94062e-312 โ”‚ 09/01/2016 โ”‚        โ”‚ 3.5     โ”‚ 2.38727e5 โ”‚ 5     โ”‚ 355   โ”‚ 345    โ”‚ 04/2046 โ”‚ 42200 โ”‚ 0      โ”‚ N      โ”‚ missing โ”‚        โ”‚        โ”‚
โ”‚ 6520497 โ”‚ 4.94062e-312 โ”‚ 10/01/2016 โ”‚        โ”‚ 3.5     โ”‚ 2.37849e5 โ”‚ 6     โ”‚ 354   โ”‚ 343    โ”‚ 04/2046 โ”‚ 42200 โ”‚ 0      โ”‚ N      โ”‚ missing โ”‚        โ”‚        โ”‚
โ”‚ 6520498 โ”‚ 4.94062e-312 โ”‚ 11/01/2016 โ”‚        โ”‚ 3.5     โ”‚ 2.36968e5 โ”‚ 7     โ”‚ 353   โ”‚ 341    โ”‚ 04/2046 โ”‚ 42200 โ”‚ 0      โ”‚ N      โ”‚ missing โ”‚        โ”‚        โ”‚
โ”‚ 6520499 โ”‚ 4.94062e-312 โ”‚ 12/01/2016 โ”‚        โ”‚ 3.5     โ”‚ 2.36087e5 โ”‚ 8     โ”‚ 352   โ”‚ 339    โ”‚ 04/2046 โ”‚ 42200 โ”‚ 0      โ”‚ N      โ”‚ missing โ”‚        โ”‚        โ”‚
โ”‚ 6520500 โ”‚ 4.94062e-312 โ”‚ 01/01/2017 โ”‚        โ”‚ 3.5     โ”‚ 2.35204e5 โ”‚ 9     โ”‚ 351   โ”‚ 336    โ”‚ 04/2046 โ”‚ 42200 โ”‚ 0      โ”‚ N      โ”‚ missing โ”‚        โ”‚        โ”‚
โ”‚ 6520501 โ”‚ 4.94062e-312 โ”‚ 02/01/2017 โ”‚        โ”‚ 3.5     โ”‚ 2.34317e5 โ”‚ 10    โ”‚ 350   โ”‚ 334    โ”‚ 04/2046 โ”‚ 42200 โ”‚ 0      โ”‚ N      โ”‚ missing โ”‚        โ”‚        โ”‚
โ”‚ 6520502 โ”‚ 4.94062e-312 โ”‚ 03/01/2017 โ”‚        โ”‚ 3.5     โ”‚ 2.33429e5 โ”‚ 11    โ”‚ 349   โ”‚ 332    โ”‚ 04/2046 โ”‚ 42200 โ”‚ 0      โ”‚ N      โ”‚ missing โ”‚        โ”‚        โ”‚
โ”‚ 6520503 โ”‚ 4.94062e-312 โ”‚ 04/01/2017 โ”‚        โ”‚ 3.5     โ”‚ 2.32537e5 โ”‚ 12    โ”‚ 348   โ”‚ 330    โ”‚ 04/2046 โ”‚ 42200 โ”‚ 0      โ”‚ N      โ”‚ missing โ”‚        โ”‚        โ”‚
โ”‚ 6520504 โ”‚ 4.94062e-312 โ”‚ 05/01/2017 โ”‚        โ”‚ 3.5     โ”‚ 2.31643e5 โ”‚ 13    โ”‚ 347   โ”‚ 328    โ”‚ 04/2046 โ”‚ 42200 โ”‚ 0      โ”‚ N      โ”‚ missing โ”‚        โ”‚        โ”‚
โ”‚ 6520505 โ”‚ 4.94062e-312 โ”‚ 06/01/2017 โ”‚        โ”‚ 3.5     โ”‚ 2.30747e5 โ”‚ 14    โ”‚ 346   โ”‚ 326    โ”‚ 04/2046 โ”‚ 42200 โ”‚ 0      โ”‚ N      โ”‚ missing โ”‚        โ”‚        โ”‚

Also, julia takes an error exit if, for example, I try describe(a).

can you post an issue in DataFrames for the describe error? we have tons of try-catch statements there to try prevent any error like that.

When you say โ€œjulia takes an error exitโ€, you mean it crashes, right? Then itโ€™s probably not a bug in describe but in Feather.jl.

Yes, Julia crashes. I believe the problem is in Feather.read or perhaps the feather file written by the R packages is corrupt.

Actually, when I use the data I actually supply the full list of column types. The read speed is not that different in that case in data.table. Actually the first column should be read as a string according to the official tutorial on Fannie Maeโ€™s website.

I tried TableReader on a slightly larger file from the Fannie Mae dataset and it took 3 times longers when the files is only 200mb larger. :sob:

This has now been fixed and I am happy to report that CSV.jl is now even faster than TableReader.jl on this dataset now with minimal โ€œhand-holdingโ€ i.e. I donโ€™t need to specify too much except the delim and header

using CSV
@time  a = CSV.read(path, delim='|', header =0)

CSV.jl is also only about twice as slow as Rโ€™s data.table::fread even though fread is multi-threaded.

14 Likes

Note that you donโ€™t even need to specify the delim='|' any more as itโ€™s detected automatically :smile:

6 Likes

I thought JuliaDB was designed for out-of-core tasks.

At the endโ€ฆ Did you do any benchmark with different packages or options?

I am pretty excited! CSV.jl has gotten to the point where it beats using RCall.jl and data.table::fread hands down!

I can read the Fannie Data alot faster now! I can read the smallest file from Fannie Mae which is about 500mb in size using CSV.read on Julia 1.3 in about 5 seconds (11s including compilation), but RCall.jl and fread is up wards of 20 seconds (on Julia 1.2 as Rcall.jl isnโ€™t working for me on 1.3).

However, pure fread is just under 3s, so is still faster than CSV.jl, but itโ€™s a point where I wouldnโ€™t reach for R just because of speed!

I can also load a 7GB dataset in about 320s with CSV.jl with threaded=true, however itโ€™s only 42 seconds with fread. So for large datasets, there is still a gap in performance for large data.

9 Likes

I tested reading every file of Fannie Maeโ€™s data (from Fannie Mae Single-Family Loan Performance Data | Fannie Mae and you need to register to download, but this is one of the best open data sources).

This is the timings I got from CSV.read vs data.table::fread on my computer with 64G RAM. I only recorded the timings once, but the point is not absolute precision here

image

I have to say Julia CSV parsing is way better now vs before! Thanks to @quinnjโ€™s great contributions! Even though data.table::fread is still better for the Fannie Mae case, I think itโ€™s gotten to the point where I wouldnโ€™t reach for R straight-away. The type-inferencing and auto-detection of delimiter are pretty awesome in CSV.jl at this time.
To put things into perspective, letโ€™s consider one of the most popular CSV readers in the R-sphere, reader::read_csv. It canโ€™t detect the delimiter (duh, they would say CSV means comma separated). Also, it is roughly 2x slower than CSV.jl! Every time I see a post/tweet recommending readr::read_csv, I die a little. Clearly, data.table::fread is the one to beat! So Iโ€™ve been spamming the various CSV readers in the Julia-verse with my comments about Fannie Mae and fread. In Python, pandas.read_csv is fairly competent, but canโ€™t detect the delimiter correctly, and is slower than CSV.jl. Another new thing in Python-land is pyarrow which is quite fast, but if you convert the pyarrow.Table to pandas dataframe then itโ€™s still slower than CSV.read which reads data and converts it to a DataFrame.

Curiously, I can get TextParse.jl to read the CSVs but not JuliaDB.jl.

1 Like

Wow. This no longer worksโ€ฆ Gotta to report a bug.

Are the axes inverted? Y-axis should be time and X-axis file size?

3 Likes

You need to use capital letters for the format specifier, and delim is not a keyword argument. load(File(format"CSV", file_path), '|', type_detect_rows = 1000) is the correct syntax here.

1 Like

Yeah