[ANN] Fread.jl - read CSVs faster with the help of R's {data.table}

Update
You should really be using CSV.jl because it performs quite well. I only use Fread.jl for converting data from parquet to feature etc now and not for reading CSVs as CSV.jl is actually really good.

Original content

Let’s be honest, all we care about is the speed of CSV reading. I think it was one of @jeff.bezanson’s quotes.

@quinnj has been putting some great to make CSV.jl pretty awesome! It’s getting pretty close to being absolutely awesome! See Refactor internals to allow better memory efficiency by quinnj · Pull Request #510 · JuliaData/CSV.jl · GitHub

However, there is no beating of 10 years of fine-tuned awesomeness by the {data.table} crew!

So I am bring the {data.table} awesomeness to Julia via Fread.jl

using Fread
df = fread("path/to/your/file.csv")

want to use {data.table}'s other arguments? You need to use arg = explicitly. E.g.

using Fread
df = fread("path/to/your/file.csv", sep = "|", nrows = 5000)

It should be faster for reading large CSVs than all native Julia CSV reader at the moment (including CSV.jl#jq/mem3 as of 20191010 on Julia 1.3-rc3).

Here are two benchmarks

txtplot_read csvplot_read

7 Likes

One more benchmark

image

1 Like

Just from a quick browse of the code, it seems like you use R’s fread to read the table and write it as a feather file, and then use Julia’s native Feather package to read it back in.

Just wondering if this back and forth is still faster than CSV.jl and if so, why? What makes the CSV.jl package much slower in this case?

I think one potential issue here is that Feather.read doesn’t actually load the data from disc, it just reads the meta-data, and then the data will get loaded from disc when you actually access values. Not sure how these benchmarks were run, but potentially they didn’t include the reading of the data from disc from the feather files back into memory.

6 Likes

Oh yes the data is just mapped for feather. I have tried to run a group by after though. The performance is decent.

Firstly reading from feather doesn’t actually read. It just maps. Secondly, fread is very mature (10 years of development), so it’s much faster. If Julia has arrow then the data can be accessed quicker without going thru the feather step.

1 Like

If anyone finds issue with the benchmark in anyway feel free

  1. Try it on your own data
  2. suggest a way to eliminate the issues

Happy to incorporate. I think Fread.jl will die once Julia has a world beating CSV reader which @quinnj is working on.

1 Like