How to use aggregate_vec to create two output columnd JuliaDB.jl & IndexedTables?


#1

Please see my MVE for creating the data needed for this question

using Distributions, PooledArrays

N=Int64(2e7); K=100;

pool = [@sprintf "id%03d" k for k in 1:K]

function randstrarray(pool, N)
    PooledArray(PooledArrays.RefArray(rand(UInt8(1):UInt8(K), N)), pool)
end

using JuliaDB
@time DT = IndexedTable(
  Columns([1:N;]),
  Columns(
    id3 = randstrarray(pool, N),
    v1 =  rand(1:5, N),                          # int in range [1,5]
    v3 =  rand(round.(rand(Uniform(0,100),100),4), N) # numeric e.g. 23.5749
    ));

I am trying to replicate some data.table benchmarks, in particular

system.time( DT[, list(sum(v1),mean(v3)), keyby=id3] )
system.time( DT[, lapply(.SD, sum), keyby=id6, .SDcols=7:9] )

the first one creates a sum of v1 and mean of v3 by id3. Using JuliaDB/IndexedTables it would be

dt1 = aggregate_vec(sum, DT, by =(:id3,), with =:v1)
dt2 = aggregate_vec(mean, DT, by =(:id3,), with =:v3)

but I can’t figure out how to assign a name to the output (mean) column so it’s a bit tricky to merge them together given that merging the dt1 & dt2 wihtout named columns will get rid of one of the columns, e.g.

dt3 = merge(dt1, dt2) # missing one column

Also I think the below syntax for computing multiple output in one go is a good idea, hopefully I will learn enough to do a PR

dt2 = aggregate_vec([mean, sum], DT, by =(:id3,), with =(:v3, :v3))
dt2 = aggregate_vec((mean, sum), DT, by =(:id3,), with =(:v3, :v3))

I have even less of a clue of how to do this

system.time( DT[, lapply(.SD, mean), keyby=id4, .SDcols=7:9] )

which is computing the mean of the 7th -9th column of each subgroup dataframe.

Any help will be appreciated I am trying to complete a comparison of Julia vs R for data manipulations here


#2

You simply need to reduce with function that returns a Tuple (or a Named Tuple if you want names) . For example:

dt1 = aggregate_vec(v -> (sum(v), mean(v)), DT, by =(:id3,), with =:v1)

or

using NamedTuples
dt1 = aggregate_vec(v -> @NT(sum = sum(v), mean = mean(v)), DT, by =(:id3,), with =:v1)

There should be examples in this section of the docs.

EDIT: sorry, I misread your question, if you want mean and sum of different columns it should be like this:

dt1 = aggregate_vec(v -> @NT(sum =sum(t->t.v1, v), mean = mean(t->t.v3, v)), DT, by =(:id3,))


#3
(
@elapsed aggregate_vec(
  v -> @NT(sum = sum(t->t.v1, v), mean=mean(t->t.v3, v)), DT, by =(:id3,))
)

((@elapsed dt1 = aggregate_vec(sum, DT, by =(:id3,), with =:v1))+
(@elapsed dt2 = aggregate_vec(mean, DT, by =(:id3,), with =:v3)))

I tried the two ways above and the timings of the first one i ss 240 seconds vs 35.9 in the second! I am running N = 2e9/8.

To present the best possible solution for Julia, I would need to find a performant solution.


#4

I see, if you want to optimize it as much as possible, then maybe it’d be better to extract the columns using the column function (and maybe it also helps to specify we are only using :v1 and :v3:

aggregate_vec(v -> @NT(sum = sum(column(v, :v1)), mean = mean(column(v, :v3))), DT, by =(:id3,), with = (:v1, :v3))

Though I agree that this is a reasonably common operation, so maybe the syntax could maybe be simplified a bit in IndexedTables, not sure how exactly (maybe some macro to infer that we are only using columns :v1 and :v3?)


#5

It would be nice to find a common interface with DataFrames, since these are really the same operation applies to two very similar structures.


#6

This seems like very good timing, as both packages are revisiting their API:

https://github.com/JuliaComputing/JuliaDB.jl/issues/80

https://github.com/JuliaData/DataFrames.jl/issues/1256