CSV Reader Benchmarks: Julia Reads CSVs 10-20x Faster than Python and R

oheil · June 24, 2020, 12:12pm

This is huge!
For non germans it is about recent performance boosts in Julia 1.4 and refers mainly to the following Blog
CSV Reader Benchmarks: Julia Reads CSVs 10-20x Faster than Python and R - JuliaHub (blog url updated)

robsmith11 · June 24, 2020, 12:43pm

Very impressive results, especially the multi-threaded scaling.

Still… I can’t help but think the title could have been a bit more fair to R and Python. A 10-20x advantage isn’t a fair summary of the results.

oheil · June 24, 2020, 12:46pm

Somehow true, as it depends strongly on the use case and the usage of threads.
I am not sure if the average data analyst, who uses R now, will see any performance increase while playing around with Julias CSV.jl.

stevengj · June 24, 2020, 12:49pm

I can’t help thinking that this framing undersells Julia — the real story here is that pure Julia CSV reading beats highly optimized C libraries that are popular to call from Python and R.

(Though this is explained in the second paragraph.)

oheil · June 24, 2020, 12:54pm

This is emphasized in the german article (don’t know if you can read it, so I just tell it ).

xiaodai · June 24, 2020, 2:03pm

I can’t tell the R and Julia colours apart.

xiaodai · June 24, 2020, 2:04pm

The Fannie Mae perf dataset contains the largest files. I wish that was tested instead of just testing the Acquisition files.

tbeason · June 24, 2020, 2:51pm

I wish absolute timings would have been reported somewhere.

What is great is that the differences become apparent even for smaller thread counts. Sure, some people might be able to spin up 16+ threads, but most consumers will not have these high performance machines, and instead will have 4/6/8 threads.

Also ditto that the color scheme chosen for those plots is absolutely terrible.

xiaodai · June 24, 2020, 2:53pm

To be fair for sometimes fread is still faster than CSV.jl on my machine (6 cores). But CSV.jl has gotten to the point where I think it’s viable to do data manipulation in Julia instead.

Oscar_Smith · June 24, 2020, 2:56pm

This post was great! The one thing I wish it had was some memory analysis. I assume Julia uses more, but I would love to be wrong.

lungben · June 24, 2020, 3:52pm

Looks great!

Did the benchmark measure reading the csv into a DataFrame, i.e. CSV.File("test_data.csv") |> DataFrame?

However, the first time executing this command takes quite long for me due to compilation time.
I am not sure in which cases recompilation is required, probably if the column (types) of the csv files are different?

xiaodai · June 24, 2020, 4:07pm

The main branch of CSV works quite well for me.

Juan · June 24, 2020, 6:18pm

Most benchmarks I’ve seen before showed that data.table was much faster.

Has the situations changed that much? Or they have just used different tests?

lungben · June 24, 2020, 6:33pm

My test case is

using CSV, DataFrames, PyCall, BenchmarkTools
pd = pyimport("pandas")
download("https://nyc-tlc.s3.amazonaws.com/trip+data/green_tripdata_2019-12.csv", 
    "test_data.csv")
@time df = CSV.File("test_data.csv") |> DataFrame # including compilation
@btime df = CSV.File("test_data.csv") |> DataFrame
@btime pydf = pd.read_csv("test_data.csv")

The 1st csv reading takes 22s for me (on a quite weak machine), but each consequtive reading only 400 ms.
For comparison, Pandas takes 2s.
I tested on Julia 1.4.1 using the most recent versions of CSV.jl and DataFrames.jl.

affans · June 24, 2020, 6:37pm

I am still under the impression that fread is faster (in terms of “feel”). Let me explain. First, loading library(data.table) is much faster than using CSV, DataFrames. Although it makes sense since one is loading compiled libraries, the other has to either precompile (or compile at first run).

The second reason i think fread is faster is simply the printing of the results back in the console. I don’t know why, but loading a big dataframe in the REPL is often “laggy” and “choppy” for me, but fread reads and prints almost instantly. Maybe I am just being pedantic.

alejandromerchan · June 24, 2020, 7:31pm

I like that one of the comments, if my German is correct, says that he doesn’t care about comparisons to R and Python, and want to see comparisons between Julia and FORTRAN. I wonder how many data analyst really use FORTRAN knowingly and on purpose?

viralbshah · June 24, 2020, 7:45pm

This is very true. I wish we could talk up the fact that you can do this in Julia all the way. In my experience, people are never convinced with that argument as much as they are convinced by benchmark numbers.

I always @jeff.bezanson’s line from a while ago sums it up: Come for the performance, stay for the experience.

StefanKarpinski · June 24, 2020, 8:23pm

People are always moving the bar. Of course it’s a silly comment since the R and Python CSV parsers are written in C and there’s no reason Fortran would be faster. At this point, it’s entirely possible that CSV.jl is the fastest overall CSV parser in existence.

Yifan_Liu · June 24, 2020, 9:22pm

This is a great first step. Hopefully dataframes will be as fast as data.table in the future.

davidanthoff · June 25, 2020, 4:12am

I think the state-of-the-art CSV reader on Python these days is pyarrow, not pandas. It also is way faster than fread from R. I believe the team that originally created pandas has long ago moved all their efforts over to the arrow/pyarrow project. I’ve been running a fairly comprehensive CSV benchmarking comparison for a couple of years here, and the parallel version of pyarrow is the thing to beat these days

Topic		Replies	Views
CSV read in is too slow than other language General Usage performance	13	1358	June 21, 2023
CSV read performance vs Pandas General Usage	29	8149	May 6, 2019
CSV Reading (rewrite in C?) Internals & Design	50	5068	October 1, 2018
Reading Data Is Still Too Slow Data	35	8815	August 2, 2019
My experiences reading CSVs from the Fannie Mae datasets Data performance , csv	62	6143	August 26, 2019

CSV Reader Benchmarks: Julia Reads CSVs 10-20x Faster than Python and R

Related topics