CSV Reader Benchmarks: Julia Reads CSVs 10-20x Faster than Python and R

dlakelan · June 25, 2020, 6:54pm

Another issue is that we don’t have to sacrifice expressibility for speed in Julia. Higher level functional programming in particular: map/reduce and parallel for loops and autodifferentiation and macros that let you write domain specific languages within Julia… that’s all stuff you really want. It’s a disaster in C or C++.

I’m interested in MCMC/sampling techniques, and I sure as heck am not going to write high performance samplers in R directly. And I sure as heck am not going to write them in C/C++ either (insert grumpy cat face here).

randyzwitch · June 25, 2020, 7:48pm

Yesterday, I mentioned on Twitter that every time I use R or Python, I see multiple dispatch and can’t use it. Yes, the two-language problem for speed doesn’t have to be an issue, but it’s hard not to notice libraries with awful C/C++/Java APIs instead of the dynamic language’s native syntax. Julia doesn’t have this issue.

Tamas_Papp · June 26, 2020, 9:06am

Incidentally, MCMC samplers are pretty trivial to write given a loglikelihood (& derivatives as appropriate). Eg NUTS proper takes <500 of C++ code in Stan, most of which is boilerplate.

For Bayesian practical inference, the tricky part is making the likelihood (and, again, its derivatives) evaluate fast. This is where Julia shines: your log-likelihood is just another function, mapping some inputs to a real number. You can AD it, debug it, unit test building blocks, benchmark and optimize it… all within Julia.

Also, since it is just another function, you have the entire universe of Julia packages at your disposal. Want to use ODEs in your model? No problem. A custom distribution? Just code it and make it play nice with AD. With a few simple rules in mind, you can code in the style of exploratory research, and still be within a factor 2-5x of heavily optimized code (which you can do in Julia gradually as needed).

dlakelan · June 26, 2020, 1:33pm

Yes, the ability to write models directly in a performant language is huge. But I’m also interested in doing things like parallel tempering, surrogate sampling, and adaptive strategies. The idea being that information about the problem can be used to adapt the sampling methodology… And then the adapted method can be used to pull the final sample. I’ve experienced far too many instances where models don’t move from initial conditions or out of say 4 chains only 2 of them sample well… Etc and I find this out after hours or overnight runs, when proper diagnostics and adaptation would have fixed the problem in minutes. I’m of the opinion that there is a lot we can do to waste less of the analysts time and Julia enables such research while retaining serious levels of performance.

klaff · June 26, 2020, 3:10pm

I’m an electrical engineer working in industry and I’ve spent the last couple of weeks working on a simulation in Julia. It’s not an open-source problem and I don’t have other people to write anything for me, but I’ve been able to go from needing 150 s of time and 16 GB of RAM to simulate 2 seconds of my system, to now needing 80 ms to compute 10 s of system time (plotting is now slower than the actual simulation!). That’s about 4 orders of magnitude difference. To get from point A to point B required both an expressive language that I could iterate quickly in and a language that was efficient. It’s awesome that they were the same one.

Much of that improvement was accomplished by changing my approach to the problem (think state-space averaging vs brute force), but doing the brute force method first was pretty much essential in enabling me to take the following steps (and believing that they were valid).

I think there’s a limit to how much I can juggle in my mind at one time. I know C, but the thought of writing this in C makes me shudder and frankly I just don’t have the time or patience for it. When I learned Python 20 years ago I was thrilled because it let me attack problems that I couldn’t have attacked as quickly, it let me think at a higher level. I learned to write code which I would throw away because I could write a better one even faster having written the first. Fast cycles. But Julia is more expressive (I mean both clearer and more concise) and faster in execution. It’s more complicated than Python but less complicated than Python plus the universe of stuff that tries to make Python faster than Python.

StefanKarpinski · June 26, 2020, 3:33pm

This is a great way to put it!

sivakon · June 26, 2020, 4:31pm

Most of the times when I’m reading a huge dataset, I read top 10 rows first (trigger compilation) then read the huge dataset. This is a huge win for Julia.

StefanKarpinski · June 26, 2020, 5:04pm

Just to clarify, since people seem to often be under misapprehensions about this, compilation takes a fixed amount of time, no matter how big a problem you’re doing. So in this case, triggering compilation first and then loading the whole data set is unnecessary unless you’re timing the huge dataset loading and want to omit compilation overhead.

This is why we generally exclude compilation time from benchmarks: the point of benchmarking is to measure a small, quick task that’s representative of a bigger, more complex task, so that we can extrapolate how fast it would be to do the large, slow task. For a quick task, compilation overhead can be a large portion, but since it’s a fixed overhead, it doesn’t grow as the problem gets bigger. Of course, if the compilation overhead is on the order of 10s and the that’s also how long it takes to load your data set, yes, that’s significant, but if you’re loading a really big data set that takes minutes to load, then the compilation is still only going to take 10s.

Oscar_Smith · June 26, 2020, 5:08pm

The problem with this is that for something like CSV reading, compile time scales with columns, so if the dataset grows in both dimensions, it’s conceivable that bigger problems have similar percents of compile time.

StefanKarpinski · June 26, 2020, 5:15pm

CSV does not specialize on the number of columns (anymore), so that’s not the case as far as I’m aware.

jonniedie · June 26, 2020, 5:47pm

Other than a few missing features, there are projects in both languages that are already there: python, R.

Skoffer · June 26, 2020, 6:13pm

Found this old thread on stackoverflow: reading csv in Julia is slow compared to Python - Stack Overflow

May be it make sense to add link to to the juliacomputing post there, cause 6 years later situation is totally different.

johnmyleswhite · June 26, 2020, 6:47pm

This seems like a point in Stefan’s favor if those packages are just wrappers for a Julia package. Julia is doing great if other languages are wrapping it rather than the other way around.

StefanKarpinski · June 26, 2020, 7:07pm

I suspect that post was humorous

Balinus · June 26, 2020, 7:18pm

Probably, I had a good laugh when I saw @ChrisRackauckas in both repos!

Gunter_Faes · June 26, 2020, 8:29pm

I suppose a vision is? I know R quite well and the ecosystem is enormously extensive and of a very high quality. I look forward to the Julia ecosystem reaching a similarly high level as R in the near future. I’m not just thinking about the speed of execution, because as I’ve indicated, the data analyst looks at the entire ecosystem before moving to a new system like Julia.

But to be clear, I see the potential of Julia and fully support it! See here! It would be great to have more than a replacement for R (or Python) with Julia!

StefanKarpinski · June 26, 2020, 8:30pm

You know it when you see it

oxinabox · June 26, 2020, 8:40pm

My recollection of history there was Chris was trying to publish a paper about DifferentialEquations.jl.
And it got rejected because “it is in an obsure language noone will use it”, and in a weekend of pique, Chris coded up the python and R wrappers.
So that one reviewer probably did more to advance the cause of DE solving in R and Python than many R or Python programmer working in the area ever have.

Yifan_Liu · September 4, 2020, 11:17pm

data.table::fread is a fast csv reader in R, but not the fastest. vroom is even faster in reading csv, and I guess it should also be much faster than CSV.jl.

https://github.com/r-lib/vroom

jling · September 4, 2020, 11:23pm

why?; I almost feel like the benchmark anything show depends on threads / shape / types of the test files.

Topic		Replies	Views
[ANN] Fread.jl - read CSVs faster with the help of R's {data.table} Package Announcements performance , data , csv	6	2049	October 9, 2019
TextParse.jl is fast again Data announcement	14	2541	October 30, 2018
CSV read in is too slow than other language General Usage performance	13	1346	June 21, 2023
CSV read performance vs Pandas General Usage	29	8133	May 6, 2019
My experiences reading CSVs from the Fannie Mae datasets Data performance , csv	62	6123	August 26, 2019

CSV Reader Benchmarks: Julia Reads CSVs 10-20x Faster than Python and R

Related topics