This is huge!
For non germans it is about recent performance boosts in Julia 1.4 and refers mainly to the following Blog
CSV Reader Benchmarks: Julia Reads CSVs 10-20x Faster than Python and R - JuliaHub (blog url updated)
Very impressive results, especially the multi-threaded scaling.
Still… I can’t help but think the title could have been a bit more fair to R and Python. A 10-20x advantage isn’t a fair summary of the results.
Somehow true, as it depends strongly on the use case and the usage of threads.
I am not sure if the average data analyst, who uses R now, will see any performance increase while playing around with Julias CSV.jl.
I can’t help thinking that this framing undersells Julia — the real story here is that pure Julia CSV reading beats highly optimized C libraries that are popular to call from Python and R.
(Though this is explained in the second paragraph.)
This is emphasized in the german article (don’t know if you can read it, so I just tell it ).
I can’t tell the R and Julia colours apart.
The Fannie Mae perf dataset contains the largest files. I wish that was tested instead of just testing the Acquisition files.
I wish absolute timings would have been reported somewhere.
What is great is that the differences become apparent even for smaller thread counts. Sure, some people might be able to spin up 16+ threads, but most consumers will not have these high performance machines, and instead will have 4/6/8 threads.
Also ditto that the color scheme chosen for those plots is absolutely terrible.
To be fair for sometimes fread is still faster than CSV.jl on my machine (6 cores). But CSV.jl has gotten to the point where I think it’s viable to do data manipulation in Julia instead.
This post was great! The one thing I wish it had was some memory analysis. I assume Julia uses more, but I would love to be wrong.
Looks great!
Did the benchmark measure reading the csv into a DataFrame, i.e. CSV.File("test_data.csv") |> DataFrame
?
However, the first time executing this command takes quite long for me due to compilation time.
I am not sure in which cases recompilation is required, probably if the column (types) of the csv files are different?
The main branch of CSV works quite well for me.
Most benchmarks I’ve seen before showed that data.table was much faster.
Has the situations changed that much? Or they have just used different tests?
My test case is
using CSV, DataFrames, PyCall, BenchmarkTools
pd = pyimport("pandas")
download("https://nyc-tlc.s3.amazonaws.com/trip+data/green_tripdata_2019-12.csv",
"test_data.csv")
@time df = CSV.File("test_data.csv") |> DataFrame # including compilation
@btime df = CSV.File("test_data.csv") |> DataFrame
@btime pydf = pd.read_csv("test_data.csv")
The 1st csv reading takes 22s for me (on a quite weak machine), but each consequtive reading only 400 ms.
For comparison, Pandas takes 2s.
I tested on Julia 1.4.1 using the most recent versions of CSV.jl and DataFrames.jl.
I am still under the impression that fread
is faster (in terms of “feel”). Let me explain. First, loading library(data.table)
is much faster than using CSV, DataFrames
. Although it makes sense since one is loading compiled libraries, the other has to either precompile (or compile at first run).
The second reason i think fread
is faster is simply the printing of the results back in the console. I don’t know why, but loading a big dataframe in the REPL is often “laggy” and “choppy” for me, but fread
reads and prints almost instantly. Maybe I am just being pedantic.
I like that one of the comments, if my German is correct, says that he doesn’t care about comparisons to R and Python, and want to see comparisons between Julia and FORTRAN. I wonder how many data analyst really use FORTRAN knowingly and on purpose?
This is very true. I wish we could talk up the fact that you can do this in Julia all the way. In my experience, people are never convinced with that argument as much as they are convinced by benchmark numbers.
I always @jeff.bezanson’s line from a while ago sums it up: Come for the performance, stay for the experience.
People are always moving the bar. Of course it’s a silly comment since the R and Python CSV parsers are written in C and there’s no reason Fortran would be faster. At this point, it’s entirely possible that CSV.jl is the fastest overall CSV parser in existence.
This is a great first step. Hopefully dataframes will be as fast as data.table in the future.
I think the state-of-the-art CSV reader on Python these days is pyarrow, not pandas. It also is way faster than fread from R. I believe the team that originally created pandas has long ago moved all their efforts over to the arrow/pyarrow project. I’ve been running a fairly comprehensive CSV benchmarking comparison for a couple of years here, and the parallel version of pyarrow is the thing to beat these days