CSV Reader Benchmarks: Julia Reads CSVs 10-20x Faster than Python and R

danielw2904 · September 4, 2020, 11:53pm

Its hard to compare vroom since

vroom doesn’t stop to actually read all of your data, it simply indexes where each record is located so it can be read later.

xiaodai · September 5, 2020, 1:26am

Except it doest not work half the time. Like lots of limitations are not mentioned. I tried it on Fannie Mae and it was slow.

aschmu · September 7, 2020, 7:03pm

I understand the idea but from a user’s perspective removing compilation is kind of cheating since in practice most of the time, I personally only load the data once per session so this fixed cost is something I experience. It’s pretty common as a data scientist to load a 500k x 40 csv do a bunch of data processing, try a few models and stop the Julia session before ringing and repeating the next day. The next time this process occurs I’m still experiencing this 10-15s load time. Granted it’s really not horrible but CSV.read doesn’t feel as fast in practice compared to pandas or data.table for small/medium sized datasets. I’m sure it blows both out of the water for huge CSVs but that’s not necessarily the majority of the use cases.

pdeffebach · September 7, 2020, 7:11pm

I disagree. If you aren’t running the script from start to finish consistently throughout your work session, you aren’t working in a way that guarantees reproducible results.

When I’m working, I run main() all the time! Fast CSV reading after precompilation makes my life a lot easier.

DNF · September 7, 2020, 7:40pm

If that is the case, what is the problem? I mean, I could understand that it would be annoying if you paid that compilation price every 5-10 minutes, but once a day? 0.05% of your workday?

xiaodai · September 8, 2020, 6:36am

For me, the first thing I do is load it via CSV then save it into a more efficient format like JDF.jl. That’s my 1_import_data.jl From step 2 I load from using JDF.jl which should be faster. I saw some good benchmarks for the JLD2.jl as well but not yet tested on large dataset yet.

Gunter_Faes · September 9, 2020, 9:40am

@aschmu, this way of working is not unusual for a data scientist. I usually load the data as data frame at the beginning of the session and if I can afford it, I make a working copy of the original data. With this working copy I then perform the analyses and can access the original data if necessary. So reading the CSV file is usually only necessary once and it does not depend on the second. So it almost doesn’t matter if I import the CSV file with R, Python or Julia.

Balinus · December 2, 2020, 3:51pm

I think that one way of looking at it is that if you only do interactive data analysis, it does not matter much which language you use. Your idling time (thinking between commands, etc…) is far higher than computing time.

Where it does matter is when your computations and workflow is defined and you need to launch the whole process: that’s where Julia shines… assuming your calculations are important and not simply a 1min calculations.

Ideally, I like to do both in the same language, it’s much more productive.

stevengj · March 23, 2022, 1:48pm

5 posts were split to a new topic: Updated CSV reader benchmarks?

Topic		Replies	Views
CSV read performance vs Pandas General Usage	29	8290	May 6, 2019
CSV Reading (rewrite in C?) Internals & Design	50	5210	October 1, 2018
CSV read in is too slow than other language General Usage performance	13	1447	June 21, 2023
Reading Data Is Still Too Slow Data	35	8969	August 2, 2019
My experiences reading CSVs from the Fannie Mae datasets Data performance , csv	62	6362	August 26, 2019

CSV Reader Benchmarks: Julia Reads CSVs 10-20x Faster than Python and R

Related topics