TextParse.jl is fast again

davidanthoff · October 23, 2018, 3:23am

Some of you might have noticed that TextParse.jl (and CSVFiles.jl, which is a small wrapper around TextParse.jl) saw some major performance regressions initially on julia 1.0.

I just fixed these and now both packages are back to the kind of performance that we saw on julia 0.6 for them (which was pretty good). If you had given up on either package since moving to julia 1.0, I encourage you to give them another try, they should be very usable again.

As part of that work I also wrote a benchmark that compares various CSV reading packages on julia, Python and R. The high-level summary is that R’s fread beats pretty much everything, but other than that TextParse.jl is (and was on julia 0.6) looking pretty good.

Here are the detailed results:

direct link to figure

I used the currently latest tagged version of all packages that are tested. A “; 0.6” in the package name means this run was done on julia 0.6. I ran every benchmark five times. The bar shows the best of those five runs, and all five runs are shown as ticks. The different files have different types of data in the columns: the files with “mixed” in the name have one column of float, int, string, categorical string (a string column that only ever has a few different values) and datetime. The files with “uniform” in the name have 20 columns all with the same data type. “short” in the filename signals that floating point numbers don’t have more than 6 digits.

xiaodai · October 23, 2018, 5:46am

A side note is that fread is very fast at most everything that it does!

jtackm · October 23, 2018, 7:40am

Thanks a lot for your effort and these extensive benchmarks. I was wondering about the number of columns per data set, it would be great to see how each approach scales as a function of that. Last time I checked, most julia solutions (except DataFrame’s deprecated readtable and Base’s readdlm) were struggling with high-dimensional data sets (>1k columns).

xiaodai · October 23, 2018, 8:26am

I have asked before but I wonder whether Julia van eventually match fread.

Tamas_Papp · October 23, 2018, 8:46am

There is no theoretical reason it couldn’t, someone just needs to put in the micro-optimization work.

I think that fixing major performance regressions is important, so I am happy that TextParse.jl is competitive again, but I am not sure that it is super-important to beat data.table::fread once we are within 5x. Data reading is usually the least expensive part of a nontrivial computation, and one would not read all the data for large files (when this matters the most) into memory anyway.

jtackm · October 23, 2018, 9:04am

And to add to that, TextParse.jl looks already quite comparable to fread for larger numbers of rows. If one has to read many small csv files (which is also an important use case), TextParse.jl may be slower for now, but coincidentally this is where CSV.jl seems to do very well.

matthieu · October 23, 2018, 2:23pm

Looks great! Out of curiosity, can you expand a bit on why you perform better than CSV for large datasets?

davidanthoff · October 23, 2018, 4:13pm

Agreed, a PR that adds runs with say 100, 1k and 10k columns would be great!

I entirely agree! My goal with the recent work was not to write the fastest CSV parser, I mainly just wanted to get TextParse.jl back into a usable state, so I did some very targeted optimizations so that it got back to the excellent performance it had on julia 0.6 (thanks to @shashi!). There are actually a lot of places where one could do more, but for now I thought we should release a version that is back to the old performance.

No idea TextParse.jl was just always really fast, I think @shashi just did some awesome work with it. Also keep in mind that CSV.jl (the current version) is really a young package. Even though the package itself has been around for a long time, I believe the current version is essentially a recent complete rewrite. I would guess that this is simply a case where it takes time to mature a package.

jtackm · October 23, 2018, 10:10pm

Sure, I’ll have a look

davidanthoff · October 29, 2018, 4:11pm

Thanks to a PR from @jtackm we now also have results for files with 200 columns:

direct link to figure

tbeason · October 29, 2018, 4:27pm

Do you have any ideas as to why there seems to be substantial overhead between CSVFiles and TextParse? Sometimes the time difference looks negligible, but typically I see CSVFiles lagging behind, even though it is TextParse under the hood.

jtackm · October 29, 2018, 6:24pm

I’m curious as well, I saw in one additional high-dimensional test (5K rows, 6K cols) that differences are even more pronounced (TextParse < 5s, pandas < 10s, CSVFiles > 700s). But since TextParse is already there, I think that should be fixable.

davidanthoff · October 29, 2018, 6:40pm

Yes, I do I’ve had some PRs lingering around that address that for many, many months. Most of them are merged now. The only thing left to do is to merge this, and then there is no real overhead left from using CSVFiles.jl, i.e. you get the same performance that raw TextParse.jl gives you.

davidanthoff · October 29, 2018, 6:41pm

That is actually very good news, that TextParse seems to perform very well even with a couple thousand of columns!

mbauman · October 30, 2018, 8:38pm

7 posts were split to a new topic: Tables.jl vs TableTraits.jl (was TextParse.jl is fast again)

Topic		Replies	Views
CSV parsing performance Data	4	1404	July 4, 2017
CSV read performance vs Pandas General Usage	29	8160	May 6, 2019
[ANN] TableReader.jl - A fast and simple CSV parser Package Announcements package , announcement , data , csv	24	5893	March 28, 2019
CSV Reading (rewrite in C?) Internals & Design	50	5068	October 1, 2018
CSV read in is too slow than other language General Usage performance	13	1365	June 21, 2023

TextParse.jl is fast again

Related topics