How is the data ecosystem right now for large datasets?

aaowens · June 16, 2017, 3:45pm

That’s good to know. The timing for DataFrames is 18s.

shashi · June 17, 2017, 6:30pm

Another data point:

julia> N = 100_000_000
100000000

julia> A = rand(N);

julia> B = rand(1:100, N);

julia> @time t = IndexedTable(Columns(B=B), A)
  4.206044 seconds (129 allocations: 2.235 GiB, 4.40% gc time)
B   │ 
────┼──────────
1   │ 0.683076
1   │ 0.846248
1   │ 0.825938
1   │ 0.641028
1   │ 0.980836
1   │ 0.79498
1   │ 0.608793
1   │ 0.216588
1   │ 0.969373
1   │ 0.710852
    ⋮
100 │ 0.0101481
100 │ 0.528986
100 │ 0.858588
100 │ 0.924309
100 │ 0.713443
100 │ 0.74248
100 │ 0.531245
100 │ 0.502521
100 │ 0.781031
100 │ 0.63794

julia> @btime aggregate_vec(mean, dt)
  246.059 ms (225 allocations: 762.95 MiB)
B   │ 
────┼─────────
1   │ 0.499787
2   │ 0.500066
3   │ 0.500087
4   │ 0.499988
5   │ 0.500028
    ⋮

IndexedTables is fast if you recognize which columns of your data can be indexed. You can use JuliaDB.jl to assist you with loading data and doing these things in parallel. The parallel version has the same API as IndexedTables.

mkborregaard · June 17, 2017, 7:08pm

Is that the same operation as the groupwise mean and join above? Just trying to wrap my head about the interface.

shashi · June 18, 2017, 2:24pm

Yes, aggregate_vec will combine rows with the same indices using a Vector → scalar function like mean.

IndexedTable’s join/broadcast operations require that the tables have unique indices. Hence we can achieve the same effect by creating a table with a rowid column which is just 1:n…


julia> @time t = IndexedTable(Columns(rowid=[1:N;], B=B), A)
  0.523119 seconds (154 allocations: 762.947 MiB, 17.36% gc time)
rowid      B  │ 
──────────────┼───────────
1          91 │ 0.861924
2          11 │ 0.745541
3          20 │ 0.326802
4          6  │ 0.480064
5          39 │ 0.391775
6          74 │ 0.113307
7          88 │ 0.928499
8          16 │ 0.468637
9          91 │ 0.651933
10         94 │ 0.550056
              ⋮
99999991   56 │ 0.501691
99999992   79 │ 0.0800012
99999993   70 │ 0.120557
99999994   42 │ 0.169668
99999995   4  │ 0.820699
99999996   84 │ 0.00919058
99999997   89 │ 0.454209
99999998   81 │ 0.395971
99999999   21 │ 0.0274831
100000000  20 │ 0.688663

julia> @time means = reducedim_vec(mean, t, :rowid)
  4.985885 seconds (388 allocations: 2.980 GiB, 8.86% gc time)
B   │ 
────┼─────────
1   │ 0.499943
2   │ 0.499725
3   │ 0.499708
4   │ 0.499893
5   │ 0.499893
6   │ 0.500239
7   │ 0.500111
8   │ 0.499465
9   │ 0.500193
10  │ 0.500168
    ⋮
91  │ 0.500453
92  │ 0.500233
93  │ 0.50035
94  │ 0.499753
95  │ 0.500005
96  │ 0.499817
97  │ 0.499691
98  │ 0.499374
99  │ 0.50056
100 │ 0.499549
julia> @time broadcast(-, t, means)
 22.724496 seconds (253 allocations: 3.748 GiB, 1.36% gc time)
rowid      B  │ 
──────────────┼───────────
1          91 │ 0.361472
2          11 │ 0.245748
3          20 │ -0.17298
4          6  │ -0.0201752
5          39 │ -0.107821
6          74 │ -0.386973
7          88 │ 0.428944
8          16 │ -0.0314757
9          91 │ 0.151481
10         94 │ 0.0503028
              ⋮
99999991   56 │ 0.00210082
99999992   79 │ -0.419816
99999993   70 │ -0.379443
99999994   42 │ -0.330163
99999995   4  │ 0.320806
99999996   84 │ -0.490412
99999997   89 │ -0.0460049
99999998   81 │ -0.103974
99999999   21 │ -0.472331
100000000  20 │ 0.188881

Currently there’s no way to add a new column in broadcast – but that’s something we should definitely add.

nalimilan · June 18, 2017, 2:47pm

With some work on DataTables, I was able to cut down the time required to run @aaowens’s groupby example from above by half. This is not completely ready yet, but see this pull request. This illustrates that it shouldn’t be too hard to reach a performance level reasonably close to Pandas.

aaowens · June 18, 2017, 4:43pm

I ran the groupby and join example in R’s data.table.

For the code

> N = 100000000
> A = runif(N)
> B = sample(1:10, N, replace = T)
> DT = data.table(A = A, B = B)
> D2 = DT[,mean(A), by = B]
> out = merge(DT, D2)

The groupby mean takes 1.3s, which is a bit faster than Pandas. The join takes 5.8s, which is a bit over 2x faster than Pandas.

I have no idea if these timings scale to a more complicated example. It would be useful to have benchmarks on a real dataset.

mkborregaard · June 18, 2017, 8:57pm

Are there any underlying language constraints / tradeoff that mean that Julia shouldn’t be able to match the speed of data.table? (disclaimer: this question springs from deep ignorance about the underlying particulars).

tshort · June 18, 2017, 11:28pm

Not that I know of. Matthew Dowle (and the other data.table contributors) is really good at identifying fast algorithms and writing fast code. Their low-level code is in C, but I don’t see why Julia can’t hit the same speeds.

davidanthoff · June 19, 2017, 12:14am

Let me stress again how much I appreciate you doing this!

How do you measure perf in R? I would be interested to also see the dplyr numbers, but I’m not sure how to properly measure performance in R.

In general, maybe we should put up a repo that has a bunch of scripts where each is a performance test like the one described here and then it runs that for say Pandas, R data.table, R dplyr, DataTables.jl, DataFrames.jl and Query.jl? Should be relatively easy to do that with PyCall and RCall, right?

I certainly can say that such a repo would help me a lot focus my performance work for Query.jl.

aaowens · June 19, 2017, 12:22am

I used

> system.time(DT[,mean(A), by = B])
   user  system elapsed 
  1.176   0.080   1.256

I am not very familiar with R, but this seems to work, and repeated timings give approximately the same number.

davidanthoff · June 19, 2017, 4:26pm

Has anyone actually got the DataTable example to work when the columns are NullableArrays? I can get by(dt, :B, d -> mean(d[:A])) to work when the columns are normal arrays, but with NullableArrays I always just get ERROR: MethodError: no method matching /(::Nullable{Float64}, ::Int64).

nalimilan · June 19, 2017, 4:46pm

Yes, you need to do mean(dropnull(d[:A])).

davidanthoff · June 19, 2017, 6:00pm

Alright, I did some performance work on the Query.jl side of things for this case. Found lots and lots of silly little things that caused performance problems, fixing them should in general help with performance a lot Here is what I get right now in terms of timing for the example from above, using normal Arrays, i.e. not using NullableArrays.

DataTable.jl (using PR #76): 5.4s
Query.jl: 5.2s
Pandas.jl: 2.7s

The work I did is not yet merged, I got these numbers by using #124 and #40.

nalimilan · July 13, 2017, 11:59am

Just an update to note that with another pull request, I was able to reduce the time needed by DataTables for this operation to 20s, so reasonably close to Pandas, and about as fast as DataFrames (my machine appears to be slighly slower than yours).

John_Hearns · July 13, 2017, 2:36pm

Hi Expanding Man. Hello from a former HEP person also.
“idea that all data should be only ints and floats” - Well at some remove all data is integer - binary integers anyway. I shall get my own coat, and yes I know the door is over there.
Seriously though - don’t you need datestamps on events? I suppose that you only really need run number / event number pairs to uniquely identify an event.

ExpandingMan · July 13, 2017, 3:08pm

Of course you will always be stuck with having to deal with dates and strings in some way, but you are still free to choose how you represent this data in memory or on disk. For example, in Julia, dates and times are represented by integers (do DateTime().instant.periods.value) and indeed, at the end of the day everything is an integer, but the usual approach is to keep them wrapped in DateTime objects when they sit in a dataframe. This is the approach I’m starting to question. Perhaps instead of storing these in a Vector{DateTime} we should store them as a Vector{Int} with metadata that tells it to convert to DateTime only when appropriate.

This might sound silly (and I’m certainly not committed to this idea, I’ve just been tossing it around), but when one considers that ultimately all the data has to go into some sort of analysis that only understands integers and floats anyway, one wonders whether DateTime is appropriate as a wrapper for stored data or whether it is merely an interface for presenting data to humans. Similar arguments can be made for strings since these almost always represent objects that can be mapped to the integers.

As an example, you’re talking about datestamps for events: how would we have dealt with time if we encountered it in HEP? It’s a float (we may have to use integers here because of precision issues). There’d never be any question about it because everyone knows that time is represented by real numbers. Perhaps it would behoove us not to forget this fact even when someone tells us a date.

Topic		Replies	Views
DataFrames in Master (with NullableArrays) may use memory wastefully General Usage	9	1098	November 29, 2016
Struggling with Julia and large datasets General Usage question , big-data	67	10915	October 17, 2024
A living post of Julia vs R's data manipulation tasks speeds Data data	21	7737	August 27, 2021
[ANN] New and Improved JuliaDB Community package , announcement	14	2801	August 7, 2018
Julia stats, data, ML: expanding usability Statistics statistics	84	4945	October 14, 2021

How is the data ecosystem right now for large datasets?

Related topics