How is the data ecosystem right now for large datasets?

Thatโ€™s good to know. The timing for DataFrames is 18s.

Another data point:

julia> N = 100_000_000
100000000

julia> A = rand(N);

julia> B = rand(1:100, N);

julia> @time t = IndexedTable(Columns(B=B), A)
  4.206044 seconds (129 allocations: 2.235 GiB, 4.40% gc time)
B   โ”‚ 
โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
1   โ”‚ 0.683076
1   โ”‚ 0.846248
1   โ”‚ 0.825938
1   โ”‚ 0.641028
1   โ”‚ 0.980836
1   โ”‚ 0.79498
1   โ”‚ 0.608793
1   โ”‚ 0.216588
1   โ”‚ 0.969373
1   โ”‚ 0.710852
    โ‹ฎ
100 โ”‚ 0.0101481
100 โ”‚ 0.528986
100 โ”‚ 0.858588
100 โ”‚ 0.924309
100 โ”‚ 0.713443
100 โ”‚ 0.74248
100 โ”‚ 0.531245
100 โ”‚ 0.502521
100 โ”‚ 0.781031
100 โ”‚ 0.63794

julia> @btime aggregate_vec(mean, dt)
  246.059 ms (225 allocations: 762.95 MiB)
B   โ”‚ 
โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
1   โ”‚ 0.499787
2   โ”‚ 0.500066
3   โ”‚ 0.500087
4   โ”‚ 0.499988
5   โ”‚ 0.500028
    โ‹ฎ

IndexedTables is fast if you recognize which columns of your data can be indexed. You can use JuliaDB.jl to assist you with loading data and doing these things in parallel. The parallel version has the same API as IndexedTables.

1 Like

Is that the same operation as the groupwise mean and join above? Just trying to wrap my head about the interface.

Yes, aggregate_vec will combine rows with the same indices using a Vector โ†’ scalar function like mean.

IndexedTableโ€™s join/broadcast operations require that the tables have unique indices. Hence we can achieve the same effect by creating a table with a rowid column which is just 1:nโ€ฆ


julia> @time t = IndexedTable(Columns(rowid=[1:N;], B=B), A)
  0.523119 seconds (154 allocations: 762.947 MiB, 17.36% gc time)
rowid      B  โ”‚ 
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
1          91 โ”‚ 0.861924
2          11 โ”‚ 0.745541
3          20 โ”‚ 0.326802
4          6  โ”‚ 0.480064
5          39 โ”‚ 0.391775
6          74 โ”‚ 0.113307
7          88 โ”‚ 0.928499
8          16 โ”‚ 0.468637
9          91 โ”‚ 0.651933
10         94 โ”‚ 0.550056
              โ‹ฎ
99999991   56 โ”‚ 0.501691
99999992   79 โ”‚ 0.0800012
99999993   70 โ”‚ 0.120557
99999994   42 โ”‚ 0.169668
99999995   4  โ”‚ 0.820699
99999996   84 โ”‚ 0.00919058
99999997   89 โ”‚ 0.454209
99999998   81 โ”‚ 0.395971
99999999   21 โ”‚ 0.0274831
100000000  20 โ”‚ 0.688663

julia> @time means = reducedim_vec(mean, t, :rowid)
  4.985885 seconds (388 allocations: 2.980 GiB, 8.86% gc time)
B   โ”‚ 
โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
1   โ”‚ 0.499943
2   โ”‚ 0.499725
3   โ”‚ 0.499708
4   โ”‚ 0.499893
5   โ”‚ 0.499893
6   โ”‚ 0.500239
7   โ”‚ 0.500111
8   โ”‚ 0.499465
9   โ”‚ 0.500193
10  โ”‚ 0.500168
    โ‹ฎ
91  โ”‚ 0.500453
92  โ”‚ 0.500233
93  โ”‚ 0.50035
94  โ”‚ 0.499753
95  โ”‚ 0.500005
96  โ”‚ 0.499817
97  โ”‚ 0.499691
98  โ”‚ 0.499374
99  โ”‚ 0.50056
100 โ”‚ 0.499549
julia> @time broadcast(-, t, means)
 22.724496 seconds (253 allocations: 3.748 GiB, 1.36% gc time)
rowid      B  โ”‚ 
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
1          91 โ”‚ 0.361472
2          11 โ”‚ 0.245748
3          20 โ”‚ -0.17298
4          6  โ”‚ -0.0201752
5          39 โ”‚ -0.107821
6          74 โ”‚ -0.386973
7          88 โ”‚ 0.428944
8          16 โ”‚ -0.0314757
9          91 โ”‚ 0.151481
10         94 โ”‚ 0.0503028
              โ‹ฎ
99999991   56 โ”‚ 0.00210082
99999992   79 โ”‚ -0.419816
99999993   70 โ”‚ -0.379443
99999994   42 โ”‚ -0.330163
99999995   4  โ”‚ 0.320806
99999996   84 โ”‚ -0.490412
99999997   89 โ”‚ -0.0460049
99999998   81 โ”‚ -0.103974
99999999   21 โ”‚ -0.472331
100000000  20 โ”‚ 0.188881

Currently thereโ€™s no way to add a new column in broadcast โ€“ but thatโ€™s something we should definitely add.

1 Like

With some work on DataTables, I was able to cut down the time required to run @aaowensโ€™s groupby example from above by half. This is not completely ready yet, but see this pull request. This illustrates that it shouldnโ€™t be too hard to reach a performance level reasonably close to Pandas.

I ran the groupby and join example in Rโ€™s data.table.

For the code

> N = 100000000
> A = runif(N)
> B = sample(1:10, N, replace = T)
> DT = data.table(A = A, B = B)
> D2 = DT[,mean(A), by = B]
> out = merge(DT, D2)

The groupby mean takes 1.3s, which is a bit faster than Pandas. The join takes 5.8s, which is a bit over 2x faster than Pandas.

I have no idea if these timings scale to a more complicated example. It would be useful to have benchmarks on a real dataset.

2 Likes

Are there any underlying language constraints / tradeoff that mean that Julia shouldnโ€™t be able to match the speed of data.table? (disclaimer: this question springs from deep ignorance about the underlying particulars).

Not that I know of. Matthew Dowle (and the other data.table contributors) is really good at identifying fast algorithms and writing fast code. Their low-level code is in C, but I donโ€™t see why Julia canโ€™t hit the same speeds.

1 Like

Let me stress again how much I appreciate you doing this!

How do you measure perf in R? I would be interested to also see the dplyr numbers, but Iโ€™m not sure how to properly measure performance in R.

In general, maybe we should put up a repo that has a bunch of scripts where each is a performance test like the one described here and then it runs that for say Pandas, R data.table, R dplyr, DataTables.jl, DataFrames.jl and Query.jl? Should be relatively easy to do that with PyCall and RCall, right?

I certainly can say that such a repo would help me a lot focus my performance work for Query.jl.

5 Likes

I used

> system.time(DT[,mean(A), by = B])
   user  system elapsed 
  1.176   0.080   1.256 

I am not very familiar with R, but this seems to work, and repeated timings give approximately the same number.

Has anyone actually got the DataTable example to work when the columns are NullableArrays? I can get by(dt, :B, d -> mean(d[:A])) to work when the columns are normal arrays, but with NullableArrays I always just get ERROR: MethodError: no method matching /(::Nullable{Float64}, ::Int64).

Yes, you need to do mean(dropnull(d[:A])).

1 Like

Alright, I did some performance work on the Query.jl side of things for this case. Found lots and lots of silly little things that caused performance problems, fixing them should in general help with performance a lot :slight_smile: Here is what I get right now in terms of timing for the example from above, using normal Arrays, i.e. not using NullableArrays.

  • DataTable.jl (using PR #76): 5.4s
  • Query.jl: 5.2s
  • Pandas.jl: 2.7s

The work I did is not yet merged, I got these numbers by using #124 and #40.

4 Likes

Just an update to note that with another pull request, I was able to reduce the time needed by DataTables for this operation to 20s, so reasonably close to Pandas, and about as fast as DataFrames (my machine appears to be slighly slower than yours).

1 Like

Hi Expanding Man. Hello from a former HEP person also.
โ€œidea that all data should be only ints and floatsโ€ - Well at some remove all data is integer - binary integers anyway. I shall get my own coat, and yes I know the door is over there.
Seriously though - donโ€™t you need datestamps on events? I suppose that you only really need run number / event number pairs to uniquely identify an event.

Of course you will always be stuck with having to deal with dates and strings in some way, but you are still free to choose how you represent this data in memory or on disk. For example, in Julia, dates and times are represented by integers (do DateTime().instant.periods.value) and indeed, at the end of the day everything is an integer, but the usual approach is to keep them wrapped in DateTime objects when they sit in a dataframe. This is the approach Iโ€™m starting to question. Perhaps instead of storing these in a Vector{DateTime} we should store them as a Vector{Int} with metadata that tells it to convert to DateTime only when appropriate.

This might sound silly (and Iโ€™m certainly not committed to this idea, Iโ€™ve just been tossing it around), but when one considers that ultimately all the data has to go into some sort of analysis that only understands integers and floats anyway, one wonders whether DateTime is appropriate as a wrapper for stored data or whether it is merely an interface for presenting data to humans. Similar arguments can be made for strings since these almost always represent objects that can be mapped to the integers.

As an example, youโ€™re talking about datestamps for events: how would we have dealt with time if we encountered it in HEP? Itโ€™s a float (we may have to use integers here because of precision issues). Thereโ€™d never be any question about it because everyone knows that time is represented by real numbers. Perhaps it would behoove us not to forget this fact even when someone tells us a date.