A living post of Julia vs R's data manipulation tasks speeds

Update Please also see a new post

https://www.codementor.io/zhuojiadai/an-empirical-study-of-group-by-strategies-in-julia-dagnosell

Original post

https://www.codementor.io/zhuojiadai/speed-of-data-manipulation-in-julia-vs-r-cd7praapv

Comments welcome and please suggest if my Julia or R code isn’t optimum.

4 Likes

Nice resource! A data manipulation benchmark is a good idea, and it shows (not too surprisingly) that Julia still has room for improvement.

I think a lot of R users (at least in my field) would rely on the “tidyverse” for most data reshaping. Something comparing not only speed, but also ease of use (and readability…) may also be useful.

2 Likes

Yeah. I use the tidyverse as well but when speed is important ie most of my use cases then data.table becomes the goto tool. I did some other benchmarks and I found the fastest way to read csv is via RCall using the data.table package instead if using csv.jl. Hopefully datastreams.jl can change that

1 Like

It won’t. What will change it is using a completely different architecture in DataFrames. Have you read the discussions? You might want to mention these, or I think maybe they might be benchmarkable on master? The interesting development to track “live” would be how it’s doing on master.

Some comments.

using the DataFrames packages which is officially supported by Julia Computing.

Is this really the case? It’s included in JuliaPro but I don’t think I know anyone working on DataFrames from Julia Computing. Julia Computing tends to work with IterableTables (which would be an interesting candidate for benchmarking).

A side note about time taken to generate synthetic testing data

The issue (which isn’t explained in the post, it’s probably worth giving a technical explanation) is due to type-instabilities related to missing values. R has lots of optimizations on them, the current DataFrames design does not. But that is thus a problem with inferrability on access, so this isn’t a surprise.

The first it loads in 125 seconds but it also built a cache so that the next load is faster; indeed the second time it was loaded it took about 13 seconds.

This is confusing. Is the first time including the JIT compilation? You used @time there and didn’t show that you used it twice. I would expect compilation to take a few seconds, so I don’t know what can account for the full 100 seconds.

But this loading seems unusually fast and its real test should be how quickly it can then do data manipulation

I believe that JuliaDB tries to act out-of-core when it can? You might want to check that.

It would be nice to see some more complex operations benchmarked as well. Of course, after the missing data change because I don’t think anyone is supposed to think DataFrames is fast before it. But I think when this is updated to use the post-change DataFrames and some more complex operations it will be a nice resource for devs to know what to tackle. Is there a literature on common benchmark problems? This might be worth a repo.

Yeah. See “List of supported Julia packages” at the bottom of this page

This is not an issue I have a good understanding on as I am new to Julia, but would continue to do my research and learnings on it.

I think it’s hard to separate out compilation time in this case as the first time you run it also builds a “cache” regarding the data to enable faster loading of the data. In any case 125 seconds is indicative of performance.

Could you compile it on a different set of data? It would probably only be like 5 seconds max, but it would be good to make it clear.

It doesn’t change the conclusion though and also doesn’t change the sense of the magnitude of the time it takes. Once I understand it more I will probably briefly explain it in the post.

Oh yeah of course, I wouldn’t expect it to change the conclusion at all either. I’m just interested in a benchmark exploration of the details. I think it would be good to setup a repo with a lot more detailed test cases to really see how DataFrames with Union{T,Null} does and how much of an improvement it really is. I’d be willing to lend a hand. I’d assume that part of the difference is type-stability, but a comprehensive benchmark suite will show that specific algorithms could use improvements even after that’s fixed.

1 Like

https://github.com/xiaodaigh/data_manipulation_benchmarks

I set up the Github repo here. Would be awesome to have your contribution!

2 Likes

Thanks! I gotta find out what’s going on with the new data stack too, and what better way to do it than by benchmarks :smile:.

1 Like

DataFrames from git master could potentially be faster even on Julia 0.6, as the constructor no longer converts columns to DataArray by default, which means that the type instability due to missing values isn’t present. Yet, I’ve done some testing and it seems we’re still about 3 times slower than data.table (which is quite honorable I must say).

I think that’s because the grouping functions do many inefficient operations currently. Note that by cannot really be fast for this since it doesn’t know that you’re going to use a single column: it needs to extract all columns. It’s also a very flexible function, which allows you to return a full DataFrame for each group. In theory, aggregate would be a better choice for a grouped mean (closer to what data.table does), since it’s a simpler function, passing it only a subset of the needed columns, but currently it isn’t faster (see this PR for possible improvements). We might also want to add something for simple cases where computing a summary statistic over a single column, as this can be done much more efficiently than more complex operations (you don’t even need to allocate anything).

If you want to help, it would be very useful to translate the benchmarks developed by data.tables, dplyr and/or Pandas to Julia, and include the in DataFrames using the PkgBenchmark system. Indeed so far we haven’t done systematic benchmarking so we never know if improving the timings for a use case wouldn’t make them worse for others.

5 Likes

PkgBenchmark looks like the way to go.

I wonder if using RCall to benchmark R code is OK? I can still use benchmarking tools in R to benchmark the R code inside R"" block and use Julia to aggregate the results.

I have included code to test aggregate so once it’s improved it will show up

aggr2(dt) = aggregate(dt[:,[:id1,:v1]], :id1, sum)
@benchmark aggr2(dt)
1 Like

I’m not sure, but I would expect performance to be identical when using RCall. However, note that we shouldn’t include data.table benchmarks in DataFrames: the PkgBenchmark tests should only catch changes in DataFrames performance. Then we can compare the results manually with data.table.

If you want to use PkgBenchmark.jl, please try out the master branch and give comments. The API has been vastly changed since the tagged version but it needs real life testing from people that is not me before it can get tagged.

2 Likes

I submitted a similar benchmark as PR to Query.jl. that was about testing Query.jl (for different data source types) vs DataFramesMeta.jl vs R data.table.

https://github.com/davidanthoff/Query.jl/pull/154

3 Likes

@floswald maybe we can consolidate your work into this github repo

yes, by all means. I think this would make sense if we are able to use PkgBenchmark.jl, so this can easily be automatized and better maintained in the future. Also, I wouldn’t limit it to julia vs R, but keep the door open for other comparisons later on. python of course. folks in my field use this thing called stata, so that could be added there as well at some point.
i’ll submit the same PR to your repo.

I agree, speed is quite important in data process. Plus once you are fluent with data.table, it actually feels neat and even more readable than tidyverse.

On the contrary, DataFrames.jl syntax is a bit complex - flooded with special characters (!, _, __, :slight_smile: - it reminds me the old days when I was developing with kdb/q