[ANN] Fread.jl - read CSVs faster with the help of R's {data.table}

xiaodai · October 9, 2019, 1:21pm

Update
You should really be using CSV.jl because it performs quite well. I only use Fread.jl for converting data from parquet to feature etc now and not for reading CSVs as CSV.jl is actually really good.

Original content

Let’s be honest, all we care about is the speed of CSV reading. I think it was one of @jeff.bezanson’s quotes.

@quinnj has been putting some great to make CSV.jl pretty awesome! It’s getting pretty close to being absolutely awesome! See Refactor internals to allow better memory efficiency by quinnj · Pull Request #510 · JuliaData/CSV.jl · GitHub

However, there is no beating of 10 years of fine-tuned awesomeness by the {data.table} crew!

So I am bring the {data.table} awesomeness to Julia via Fread.jl

using Fread
df = fread("path/to/your/file.csv")

want to use {data.table}'s other arguments? You need to use arg = explicitly. E.g.

using Fread
df = fread("path/to/your/file.csv", sep = "|", nrows = 5000)

It should be faster for reading large CSVs than all native Julia CSV reader at the moment (including CSV.jl#jq/mem3 as of 20191010 on Julia 1.3-rc3).

Here are two benchmarks

txtplot_read csvplot_read

xiaodai · October 9, 2019, 2:20pm

One more benchmark

affans · October 9, 2019, 2:54pm

Just from a quick browse of the code, it seems like you use R’s fread to read the table and write it as a feather file, and then use Julia’s native Feather package to read it back in.

Just wondering if this back and forth is still faster than CSV.jl and if so, why? What makes the CSV.jl package much slower in this case?

davidanthoff · October 9, 2019, 4:02pm

I think one potential issue here is that Feather.read doesn’t actually load the data from disc, it just reads the meta-data, and then the data will get loaded from disc when you actually access values. Not sure how these benchmarks were run, but potentially they didn’t include the reading of the data from disc from the feather files back into memory.

xiaodai · October 9, 2019, 9:07pm

Oh yes the data is just mapped for feather. I have tried to run a group by after though. The performance is decent.

xiaodai · October 9, 2019, 9:14pm

Firstly reading from feather doesn’t actually read. It just maps. Secondly, fread is very mature (10 years of development), so it’s much faster. If Julia has arrow then the data can be accessed quicker without going thru the feather step.

xiaodai · October 9, 2019, 9:35pm

If anyone finds issue with the benchmark in anyway feel free

Try it on your own data
suggest a way to eliminate the issues

Happy to incorporate. I think Fread.jl will die once Julia has a world beating CSV reader which @quinnj is working on.

Topic		Replies	Views
My experiences reading CSVs from the Fannie Mae datasets Data performance , csv	62	6149	August 26, 2019
CSV read in is too slow than other language General Usage performance	13	1371	June 21, 2023
Reading Data Is Still Too Slow Data	35	8824	August 2, 2019
CSV Reader / Writer Choices Data	1	735	August 28, 2018
Alternative to DataFrame Readtable to read large data files with headers Data	17	4045	November 12, 2018

[ANN] Fread.jl - read CSVs faster with the help of R's {data.table}

Related topics