Thatโs good to know. The timing for DataFrames is 18s.
Another data point:
julia> N = 100_000_000
100000000
julia> A = rand(N);
julia> B = rand(1:100, N);
julia> @time t = IndexedTable(Columns(B=B), A)
4.206044 seconds (129 allocations: 2.235 GiB, 4.40% gc time)
B โ
โโโโโผโโโโโโโโโโ
1 โ 0.683076
1 โ 0.846248
1 โ 0.825938
1 โ 0.641028
1 โ 0.980836
1 โ 0.79498
1 โ 0.608793
1 โ 0.216588
1 โ 0.969373
1 โ 0.710852
โฎ
100 โ 0.0101481
100 โ 0.528986
100 โ 0.858588
100 โ 0.924309
100 โ 0.713443
100 โ 0.74248
100 โ 0.531245
100 โ 0.502521
100 โ 0.781031
100 โ 0.63794
julia> @btime aggregate_vec(mean, dt)
246.059 ms (225 allocations: 762.95 MiB)
B โ
โโโโโผโโโโโโโโโ
1 โ 0.499787
2 โ 0.500066
3 โ 0.500087
4 โ 0.499988
5 โ 0.500028
โฎ
IndexedTables is fast if you recognize which columns of your data can be indexed. You can use JuliaDB.jl to assist you with loading data and doing these things in parallel. The parallel version has the same API as IndexedTables.
Is that the same operation as the groupwise mean and join above? Just trying to wrap my head about the interface.
Yes, aggregate_vec
will combine rows with the same indices using a Vector โ scalar function like mean.
IndexedTableโs join/broadcast operations require that the tables have unique indices. Hence we can achieve the same effect by creating a table with a rowid
column which is just 1:n
โฆ
julia> @time t = IndexedTable(Columns(rowid=[1:N;], B=B), A)
0.523119 seconds (154 allocations: 762.947 MiB, 17.36% gc time)
rowid B โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโ
1 91 โ 0.861924
2 11 โ 0.745541
3 20 โ 0.326802
4 6 โ 0.480064
5 39 โ 0.391775
6 74 โ 0.113307
7 88 โ 0.928499
8 16 โ 0.468637
9 91 โ 0.651933
10 94 โ 0.550056
โฎ
99999991 56 โ 0.501691
99999992 79 โ 0.0800012
99999993 70 โ 0.120557
99999994 42 โ 0.169668
99999995 4 โ 0.820699
99999996 84 โ 0.00919058
99999997 89 โ 0.454209
99999998 81 โ 0.395971
99999999 21 โ 0.0274831
100000000 20 โ 0.688663
julia> @time means = reducedim_vec(mean, t, :rowid)
4.985885 seconds (388 allocations: 2.980 GiB, 8.86% gc time)
B โ
โโโโโผโโโโโโโโโ
1 โ 0.499943
2 โ 0.499725
3 โ 0.499708
4 โ 0.499893
5 โ 0.499893
6 โ 0.500239
7 โ 0.500111
8 โ 0.499465
9 โ 0.500193
10 โ 0.500168
โฎ
91 โ 0.500453
92 โ 0.500233
93 โ 0.50035
94 โ 0.499753
95 โ 0.500005
96 โ 0.499817
97 โ 0.499691
98 โ 0.499374
99 โ 0.50056
100 โ 0.499549
julia> @time broadcast(-, t, means)
22.724496 seconds (253 allocations: 3.748 GiB, 1.36% gc time)
rowid B โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโ
1 91 โ 0.361472
2 11 โ 0.245748
3 20 โ -0.17298
4 6 โ -0.0201752
5 39 โ -0.107821
6 74 โ -0.386973
7 88 โ 0.428944
8 16 โ -0.0314757
9 91 โ 0.151481
10 94 โ 0.0503028
โฎ
99999991 56 โ 0.00210082
99999992 79 โ -0.419816
99999993 70 โ -0.379443
99999994 42 โ -0.330163
99999995 4 โ 0.320806
99999996 84 โ -0.490412
99999997 89 โ -0.0460049
99999998 81 โ -0.103974
99999999 21 โ -0.472331
100000000 20 โ 0.188881
Currently thereโs no way to add a new column in broadcast
โ but thatโs something we should definitely add.
With some work on DataTables, I was able to cut down the time required to run @aaowensโs groupby
example from above by half. This is not completely ready yet, but see this pull request. This illustrates that it shouldnโt be too hard to reach a performance level reasonably close to Pandas.
I ran the groupby and join example in Rโs data.table.
For the code
> N = 100000000
> A = runif(N)
> B = sample(1:10, N, replace = T)
> DT = data.table(A = A, B = B)
> D2 = DT[,mean(A), by = B]
> out = merge(DT, D2)
The groupby mean takes 1.3s, which is a bit faster than Pandas. The join takes 5.8s, which is a bit over 2x faster than Pandas.
I have no idea if these timings scale to a more complicated example. It would be useful to have benchmarks on a real dataset.
Are there any underlying language constraints / tradeoff that mean that Julia shouldnโt be able to match the speed of data.table? (disclaimer: this question springs from deep ignorance about the underlying particulars).
Not that I know of. Matthew Dowle (and the other data.table contributors) is really good at identifying fast algorithms and writing fast code. Their low-level code is in C, but I donโt see why Julia canโt hit the same speeds.
Let me stress again how much I appreciate you doing this!
How do you measure perf in R? I would be interested to also see the dplyr numbers, but Iโm not sure how to properly measure performance in R.
In general, maybe we should put up a repo that has a bunch of scripts where each is a performance test like the one described here and then it runs that for say Pandas, R data.table, R dplyr, DataTables.jl, DataFrames.jl and Query.jl? Should be relatively easy to do that with PyCall and RCall, right?
I certainly can say that such a repo would help me a lot focus my performance work for Query.jl.
I used
> system.time(DT[,mean(A), by = B])
user system elapsed
1.176 0.080 1.256
I am not very familiar with R, but this seems to work, and repeated timings give approximately the same number.
Has anyone actually got the DataTable
example to work when the columns are NullableArray
s? I can get by(dt, :B, d -> mean(d[:A]))
to work when the columns are normal arrays, but with NullableArray
s I always just get ERROR: MethodError: no method matching /(::Nullable{Float64}, ::Int64)
.
Yes, you need to do mean(dropnull(d[:A]))
.
Alright, I did some performance work on the Query.jl side of things for this case. Found lots and lots of silly little things that caused performance problems, fixing them should in general help with performance a lot Here is what I get right now in terms of timing for the example from above, using normal Array
s, i.e. not using NullableArray
s.
- DataTable.jl (using PR #76): 5.4s
- Query.jl: 5.2s
- Pandas.jl: 2.7s
The work I did is not yet merged, I got these numbers by using #124 and #40.
Just an update to note that with another pull request, I was able to reduce the time needed by DataTables for this operation to 20s, so reasonably close to Pandas, and about as fast as DataFrames (my machine appears to be slighly slower than yours).
Hi Expanding Man. Hello from a former HEP person also.
โidea that all data should be only ints and floatsโ - Well at some remove all data is integer - binary integers anyway. I shall get my own coat, and yes I know the door is over there.
Seriously though - donโt you need datestamps on events? I suppose that you only really need run number / event number pairs to uniquely identify an event.
Of course you will always be stuck with having to deal with dates and strings in some way, but you are still free to choose how you represent this data in memory or on disk. For example, in Julia, dates and times are represented by integers (do DateTime().instant.periods.value
) and indeed, at the end of the day everything is an integer, but the usual approach is to keep them wrapped in DateTime
objects when they sit in a dataframe. This is the approach Iโm starting to question. Perhaps instead of storing these in a Vector{DateTime}
we should store them as a Vector{Int}
with metadata that tells it to convert to DateTime
only when appropriate.
This might sound silly (and Iโm certainly not committed to this idea, Iโve just been tossing it around), but when one considers that ultimately all the data has to go into some sort of analysis that only understands integers and floats anyway, one wonders whether DateTime
is appropriate as a wrapper for stored data or whether it is merely an interface for presenting data to humans. Similar arguments can be made for strings since these almost always represent objects that can be mapped to the integers.
As an example, youโre talking about datestamps for events: how would we have dealt with time if we encountered it in HEP? Itโs a float (we may have to use integers here because of precision issues). Thereโd never be any question about it because everyone knows that time is represented by real numbers. Perhaps it would behoove us not to forget this fact even when someone tells us a date.