If you set fread(path, nThreads = 1)
I get ~4s, while I get ~6s with TableReaderโฆ So not that much worse
I wouldnโt be surprised, if itโs relatively easy to process chunks in threads with TableReader as well!
Last time i checked tablereader failed on Windows.
Wow! Now TableReader is by far the best performer! But itโs a still a far cry from Rโs data.table::fread
at 2.5 seconds. Perhaps adding multithreading is the key then.
# download("https://github.com/xiaodaigh/testing/raw/master/Performance_2016Q1.zip", "ok.zip")
# run(`unzip -o ok.zip`)
using TableReader
path = "Performance_2016Q1.csv"
@time a = readcsv(path, delim = '|', hasheader = false); # 12~15 seconds
I tried fread
and it was unbelievably fast! But I think parallel parsing (when Julia incorporates the parallel task runtime into its core) and other minor improvements will close the gap and make TableReader.jl more competitive in this CSV parser race.
Thank you for your great package. I am going to teach your package for sure!
Going back to the question of why Feather.read
is faster than rget
, by default the .feather
file is memory-mapped so the overhead is very low and there is no copying of the data in memory.
When I tried to reproduce the feather::write_feather...
sequence in R I got a warning about not having the bit64
package available which will mean that 64-bit integers are displayed as weird-looking floating point numbers. I believe this is why reading the file into Julia produces unusual values in the V1
column
julia> a = Feather.read("/home/bates/Performance_2016Q1.feather")
6520505ร31 DataFrames.DataFrame. Omitted printing of 16 columns
โ Row โ V1 โ V2 โ V3 โ V4 โ V5 โ V6 โ V7 โ V8 โ V9 โ V10 โ V11 โ V12 โ V13 โ V14 โ V15 โ
โ โ Float64 โ String โ String โ Float64 โ Float64โฐ โ Int32 โ Int32 โ Int32โฐ โ String โ Int32 โ String โ String โ Int32โฐ โ String โ String โ
โโโโโโโโโโโผโโโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโผโโโโโโโโโโผโโโโโโโโโโโโผโโโโโโโโผโโโโโโโโผโโโโโโโโโผโโโโโโโโโโผโโโโโโโโผโโโโโโโโโผโโโโโโโโโผโโโโโโโโโโผโโโโโโโโโผโโโโโโโโโค
โ 1 โ 4.94068e-313 โ 02/01/2016 โ OTHER โ 3.75 โ missing โ 1 โ 359 โ 359 โ 01/2046 โ 12260 โ 0 โ N โ missing โ โ โ
โ 2 โ 4.94068e-313 โ 03/01/2016 โ โ 3.75 โ missing โ 2 โ 358 โ 357 โ 01/2046 โ 12260 โ 0 โ N โ missing โ โ โ
โ 3 โ 4.94068e-313 โ 04/01/2016 โ โ 3.75 โ missing โ 3 โ 357 โ 356 โ 01/2046 โ 12260 โ 0 โ N โ missing โ โ โ
โ 4 โ 4.94068e-313 โ 05/01/2016 โ โ 3.75 โ missing โ 4 โ 356 โ 355 โ 01/2046 โ 12260 โ 0 โ N โ missing โ โ โ
โ 5 โ 4.94068e-313 โ 06/01/2016 โ โ 3.75 โ missing โ 5 โ 355 โ 354 โ 01/2046 โ 12260 โ 0 โ N โ missing โ โ โ
โ 6 โ 4.94068e-313 โ 07/01/2016 โ โ 3.75 โ missing โ 6 โ 354 โ 353 โ 01/2046 โ 12260 โ 0 โ N โ missing โ โ โ
โ 7 โ 4.94068e-313 โ 08/01/2016 โ โ 3.75 โ 64208.1 โ 7 โ 353 โ 352 โ 01/2046 โ 12260 โ 0 โ N โ missing โ โ โ
โ 8 โ 4.94068e-313 โ 09/01/2016 โ โ 3.75 โ 64107.8 โ 8 โ 352 โ 351 โ 01/2046 โ 12260 โ 0 โ N โ missing โ โ โ
โ 9 โ 4.94068e-313 โ 10/01/2016 โ โ 3.75 โ 64006.4 โ 9 โ 351 โ 350 โ 01/2046 โ 12260 โ 0 โ N โ missing โ โ โ
โฎ
โ 6520496 โ 4.94062e-312 โ 09/01/2016 โ โ 3.5 โ 2.38727e5 โ 5 โ 355 โ 345 โ 04/2046 โ 42200 โ 0 โ N โ missing โ โ โ
โ 6520497 โ 4.94062e-312 โ 10/01/2016 โ โ 3.5 โ 2.37849e5 โ 6 โ 354 โ 343 โ 04/2046 โ 42200 โ 0 โ N โ missing โ โ โ
โ 6520498 โ 4.94062e-312 โ 11/01/2016 โ โ 3.5 โ 2.36968e5 โ 7 โ 353 โ 341 โ 04/2046 โ 42200 โ 0 โ N โ missing โ โ โ
โ 6520499 โ 4.94062e-312 โ 12/01/2016 โ โ 3.5 โ 2.36087e5 โ 8 โ 352 โ 339 โ 04/2046 โ 42200 โ 0 โ N โ missing โ โ โ
โ 6520500 โ 4.94062e-312 โ 01/01/2017 โ โ 3.5 โ 2.35204e5 โ 9 โ 351 โ 336 โ 04/2046 โ 42200 โ 0 โ N โ missing โ โ โ
โ 6520501 โ 4.94062e-312 โ 02/01/2017 โ โ 3.5 โ 2.34317e5 โ 10 โ 350 โ 334 โ 04/2046 โ 42200 โ 0 โ N โ missing โ โ โ
โ 6520502 โ 4.94062e-312 โ 03/01/2017 โ โ 3.5 โ 2.33429e5 โ 11 โ 349 โ 332 โ 04/2046 โ 42200 โ 0 โ N โ missing โ โ โ
โ 6520503 โ 4.94062e-312 โ 04/01/2017 โ โ 3.5 โ 2.32537e5 โ 12 โ 348 โ 330 โ 04/2046 โ 42200 โ 0 โ N โ missing โ โ โ
โ 6520504 โ 4.94062e-312 โ 05/01/2017 โ โ 3.5 โ 2.31643e5 โ 13 โ 347 โ 328 โ 04/2046 โ 42200 โ 0 โ N โ missing โ โ โ
โ 6520505 โ 4.94062e-312 โ 06/01/2017 โ โ 3.5 โ 2.30747e5 โ 14 โ 346 โ 326 โ 04/2046 โ 42200 โ 0 โ N โ missing โ โ โ
Also, julia takes an error exit if, for example, I try describe(a)
.
can you post an issue in DataFrames for the describe
error? we have tons of try-catch
statements there to try prevent any error like that.
When you say โjulia takes an error exitโ, you mean it crashes, right? Then itโs probably not a bug in describe
but in Feather.jl.
Yes, Julia crashes. I believe the problem is in Feather.read or perhaps the feather file written by the R packages is corrupt.
Actually, when I use the data I actually supply the full list of column types. The read speed is not that different in that case in data.table
. Actually the first column should be read as a string according to the official tutorial on Fannie Maeโs website.
I tried TableReader on a slightly larger file from the Fannie Mae dataset and it took 3 times longers when the files is only 200mb larger.
This has now been fixed and I am happy to report that CSV.jl is now even faster than TableReader.jl on this dataset now with minimal โhand-holdingโ i.e. I donโt need to specify too much except the delim
and header
using CSV
@time a = CSV.read(path, delim='|', header =0)
CSV.jl is also only about twice as slow as Rโs data.table::fread
even though fread
is multi-threaded.
Note that you donโt even need to specify the delim='|'
any more as itโs detected automatically
I thought JuliaDB was designed for out-of-core tasks.
At the endโฆ Did you do any benchmark with different packages or options?
I am pretty excited! CSV.jl has gotten to the point where it beats using RCall.jl and data.table::fread
hands down!
I can read the Fannie Data alot faster now! I can read the smallest file from Fannie Mae which is about 500mb in size using CSV.read
on Julia 1.3 in about 5 seconds (11s including compilation), but RCall.jl and fread
is up wards of 20 seconds (on Julia 1.2 as Rcall.jl isnโt working for me on 1.3).
However, pure fread
is just under 3s, so is still faster than CSV.jl, but itโs a point where I wouldnโt reach for R just because of speed!
I can also load a 7GB dataset in about 320s with CSV.jl with threaded=true
, however itโs only 42 seconds with fread
. So for large datasets, there is still a gap in performance for large data.
I tested reading every file of Fannie Maeโs data (from Fannie Mae Single-Family Loan Performance Data | Fannie Mae and you need to register to download, but this is one of the best open data sources).
This is the timings I got from CSV.read
vs data.table::fread
on my computer with 64G RAM. I only recorded the timings once, but the point is not absolute precision here
I have to say Julia CSV parsing is way better now vs before! Thanks to @quinnjโs great contributions! Even though data.table::fread
is still better for the Fannie Mae case, I think itโs gotten to the point where I wouldnโt reach for R straight-away. The type-inferencing and auto-detection of delimiter are pretty awesome in CSV.jl at this time.
To put things into perspective, letโs consider one of the most popular CSV readers in the R-sphere, reader::read_csv
. It canโt detect the delimiter (duh, they would say CSV means comma separated). Also, it is roughly 2x slower than CSV.jl! Every time I see a post/tweet recommending readr::read_csv
, I die a little. Clearly, data.table::fread
is the one to beat! So Iโve been spamming the various CSV readers in the Julia-verse with my comments about Fannie Mae and fread
. In Python, pandas.read_csv
is fairly competent, but canโt detect the delimiter correctly, and is slower than CSV.jl. Another new thing in Python-land is pyarrow
which is quite fast, but if you convert the pyarrow.Table
to pandas dataframe then itโs still slower than CSV.read
which reads data and converts it to a DataFrame
.
Curiously, I can get TextParse.jl to read the CSVs but not JuliaDB.jl.
Wow. This no longer worksโฆ Gotta to report a bug.
Are the axes inverted? Y-axis should be time and X-axis file size?
You need to use capital letters for the format specifier, and delim
is not a keyword argument. load(File(format"CSV", file_path), '|', type_detect_rows = 1000)
is the correct syntax here.
Yeah