Julia performs poorly on group-by benchmarks

These shouldn’t have any effect, anyway the time taken for both the first and second run are reported.

Yes, but only on plots. The number reported is maximum of the two. We could consider convincing Jan Górecki to report the minimum :smile:.

Unfortunately, CategoricalArrays come with a bunch of features that are not always welcome (for instance no natural ordering). I don’t think users should have to work with CategoricalArray if they only want to have efficient grouping. I hope DataFrames support for PooledArrays gets better in the future!

2 Likes

Hello,
I just published results after using categoricals=0.05, exact differences to previous run categoricals=false can be examined with following R query:

fs = function(x) factor(x, levels=unique(x))
d = data.table::fread("https://h2oai.github.io/db-benchmark/time.csv")
d[task=="groupby" & solution=="juliadf"
  ][batch==1544072228, batch:=1
    ][batch%in%c(1544286223, 1544333533), batch:=2
      ][batch%in%c(1,2)
        ][order(batch), tail(.SD, 2L), by=.(data=fs(data), question=fs(question), run), .SDcols=c("batch","time_sec")
          ][, data.table::dcast(.SD, data+question+run ~ batch, value.var="time_sec", )
            ][, diff:=sprintf("%2.2f%%", ((`2`-`1`)/`1`)*100)
              ][, print(.SD, topn=1e3)] -> nul

here is head the results. for first dataset, negative diff means speed up:

               data              question run         1         2    diff
  1: G1_1e7_1e2_0_0         sum v1 by id1   1     4.542     4.274  -5.90%
  2: G1_1e7_1e2_0_0         sum v1 by id1   2     0.791     0.278 -64.85%
  3: G1_1e7_1e2_0_0     sum v1 by id1:id2   1     1.854     2.777  49.78%
  4: G1_1e7_1e2_0_0     sum v1 by id1:id2   2     1.776     2.900  63.29%
  5: G1_1e7_1e2_0_0 sum v1 mean v3 by id3   1     3.811     4.019   5.46%
  6: G1_1e7_1e2_0_0 sum v1 mean v3 by id3   2     3.234     3.254   0.62%
  7: G1_1e7_1e2_0_0     mean v1:v3 by id4   1     1.807     1.857   2.77%
  8: G1_1e7_1e2_0_0     mean v1:v3 by id4   2     1.000     1.004   0.40%
  9: G1_1e7_1e2_0_0      sum v1:v3 by id6   1     3.301     3.455   4.67%
 10: G1_1e7_1e2_0_0      sum v1:v3 by id6   2     2.934     3.078   4.91%

Three "Benchplot"s you can see on the report page are not everything we have there. Please scroll down, below the plot there is a table (and links to more benchplots). You should be able to get some useful information from there also. For example when data have k factor equal to 10, question3 and question5 are terribly slow. This cannot be easily noticed when looking at default benchplot that presents k=100. Please note that table shows only first run timing, so it may suffer due to compiling query. You can get second run timings using raw timings data from https://h2oai.github.io/db-benchmark/time.csv

Unless you look at k factor equals to 2. Julia fails with ERROR: LoadError: OutOfMemoryError() when attempting to execute first question, another time julia process is killed by OS when doing second question. This case is pretty complex, dplyr OOM on question3. data.table manage to finish successfully all questions, but R’s system calls to ps -o RSS to measure memory are OOM after q3.

If we would put minimum value then we need to align that value on X axis using different value, otherwise it will overlap on the longer bar.

Of course strings could be handled more efficently behind the scene, but we want to present fastest way so categoricals are preferred, even if it is an overhead in preparing data. Additionally as I mentioned in group on factor column(s) #20.

As we are unable to guarantee that a solution won’t use categoricals internally for character variables the best seems to use fastest approach for each solution. Which will probably be categorical type.

I hope categoricals will be fixed soon, and comparison will be fair, because now as pointed out by @nalimilan other tools uses categoricals and julia (at least partially) strings. Additionally adding user defined functions #57 will highlight advantages of Julia over other solutions.

4 Likes

Thanks @jangorecki. To sum up the results, I see one large improvement and two large regressions. The regressions are for grouping on two columns, where we don’t use the optimized method yet (or only in part). I wouldn’t expect a regression, though, so I’ll have to investigate.

BTW, looking at the benchmarks locally, a large part (or even the majority) of the time is spent computing the sum/mean for each group. I suspect we need to add a special method for sum/mean to match the performance of other languages.

Is it possible to pre-compile those by() expressions not touching input data, but for examples knowing data types of input? If so, we should do it to avoid compile penalty during benchmark. The input of benchmark are data, not their schema.

A hacky way is to pass in just one row of data so that it compiles.

Using a row of input data is not acceptable, but maybe we can build single row (or zero rows even better) dataframe of the same types.
If there is an open issue for functionality to pre-compile function call please link, so I can subscribe and track status.

You can call precompile with the type of arguments, but I’m not sure how to specify keyword arguments. More importantly, if some code paths cannot be inferred statically (which is the case for some calls I think, since we don’t care about type stability in the wrapper, only in the inner loop), precompile will miss them AFAICT. In general the simplest approach is to use a toy data frame as @xiaodai suggests (one row is enough, but if it’s empty not all code will be precompiled).

That said, I’m not sure it’s completely fair to exclude the compilation time from benchmark results at this point, given that we don’t store generated code for future sessions (except if you include it in the sysimage, but that’s not the typical situation). As long as the timings for both the first and the second run are shown, I think it’s fine.

This is exactly what I think is best - to show both timings as bars and as numbers.

Added to Include timings at the end of each bar #31.

1 Like

How could the mean and sum computation be optimized? Does not Julia have a very optimized mean / sum already? Maybe related, an extract from data.table FAQ (I am not sure what Julia does):

Base R’s mean accumulates in long double precision where the machine supports it (and most do) and then does a second pass through the entire vector to adjust mean(x) by typically a very small amount to ensure that sum(x-mean(x)) is minimized. What do these packages do?
Both data.table and dplyr currently do a single pass sum(x)/length(x) where the sum is done in long double precision. They match base in terms of long double but don’t do the second pass, currently. Feedback welcome.

Julia’s sum is fast, but the most efficient way of computing by-group sums isn’t to call sum on the subset of rows for each group repeatedly as we currently do. Instead, we should directly allocate a vector with one entry per group, and add values to the corresponding entry. That way we can do a single pass over the data with an optimal memory access pattern.

5 Likes

Most operations fall into this paradigm which I call the reduction paradigm e.g. max, min, sum, mean can all fall into this. But things like median do not.

It could actually be worth to expose this functionality to the user if it gets implemented to speed up sum: in JuliaDB it’s called groupreduce and supports operations from OnlineStats as well.

2 Likes

There are two options:

  • One entry per group.
  • As many entries per group as the original number of elements, that is, replicate the computed value.

Both things are useful. R’s datatable can do both. If you append (:=) the value to the original table as a new column you get the second approach.

1 Like

What you are describing is the way how to return already computed aggregates. What nalimilan describes is the way to compute aggregates.

1 Like

DataFramesMeta does this with transform on a grouped DataFrame. But a fast syntax for this with DataFrames would be nice.

Yes, this applies to all reductions, and it would make sense to expose this to users in some way (in addition to special-casing sum and other common reductions).

2 Likes

Any suggestions for DataFrames off-memory storage for fast reading? I need to reduce timings and re-reading from csv is sub-optimal. Full script of reading data and queries for N=1e9 and K=10 takes 21 hours (swapping might be involved).
Tried JLD2 and Feather but none works for this size. Serialize and deserialize is not recommended, but will it work over time if I will keep updating julia packages but not julia itself?