How to use aggregate_vec to create two output columnd JuliaDB.jl & IndexedTables?

xiaodai · October 17, 2017, 10:14am

Please see my MVE for creating the data needed for this question

using Distributions, PooledArrays

N=Int64(2e7); K=100;

pool = [@sprintf "id%03d" k for k in 1:K]

function randstrarray(pool, N)
    PooledArray(PooledArrays.RefArray(rand(UInt8(1):UInt8(K), N)), pool)
end

using JuliaDB
@time DT = IndexedTable(
  Columns([1:N;]),
  Columns(
    id3 = randstrarray(pool, N),
    v1 =  rand(1:5, N),                          # int in range [1,5]
    v3 =  rand(round.(rand(Uniform(0,100),100),4), N) # numeric e.g. 23.5749
    ));

I am trying to replicate some data.table benchmarks, in particular

system.time( DT[, list(sum(v1),mean(v3)), keyby=id3] )
system.time( DT[, lapply(.SD, sum), keyby=id6, .SDcols=7:9] )

the first one creates a sum of v1 and mean of v3 by id3. Using JuliaDB/IndexedTables it would be

dt1 = aggregate_vec(sum, DT, by =(:id3,), with =:v1)
dt2 = aggregate_vec(mean, DT, by =(:id3,), with =:v3)

but I can’t figure out how to assign a name to the output (mean) column so it’s a bit tricky to merge them together given that merging the dt1 & dt2 wihtout named columns will get rid of one of the columns, e.g.

dt3 = merge(dt1, dt2) # missing one column

Also I think the below syntax for computing multiple output in one go is a good idea, hopefully I will learn enough to do a PR

dt2 = aggregate_vec([mean, sum], DT, by =(:id3,), with =(:v3, :v3))
dt2 = aggregate_vec((mean, sum), DT, by =(:id3,), with =(:v3, :v3))

I have even less of a clue of how to do this

system.time( DT[, lapply(.SD, mean), keyby=id4, .SDcols=7:9] )

which is computing the mean of the 7th -9th column of each subgroup dataframe.

Any help will be appreciated I am trying to complete a comparison of Julia vs R for data manipulations here

https://github.com/xiaodaigh/data_manipulation_benchmarks

piever · October 17, 2017, 10:45am

You simply need to reduce with function that returns a Tuple (or a Named Tuple if you want names) . For example:

dt1 = aggregate_vec(v -> (sum(v), mean(v)), DT, by =(:id3,), with =:v1)

or

using NamedTuples
dt1 = aggregate_vec(v -> @NT(sum = sum(v), mean = mean(v)), DT, by =(:id3,), with =:v1)

There should be examples in this section of the docs.

EDIT: sorry, I misread your question, if you want mean and sum of different columns it should be like this:

dt1 = aggregate_vec(v -> @NT(sum =sum(t->t.v1, v), mean = mean(t->t.v3, v)), DT, by =(:id3,))

xiaodai · October 17, 2017, 11:33am

(
@elapsed aggregate_vec(
  v -> @NT(sum = sum(t->t.v1, v), mean=mean(t->t.v3, v)), DT, by =(:id3,))
)

((@elapsed dt1 = aggregate_vec(sum, DT, by =(:id3,), with =:v1))+
(@elapsed dt2 = aggregate_vec(mean, DT, by =(:id3,), with =:v3)))

I tried the two ways above and the timings of the first one i ss 240 seconds vs 35.9 in the second! I am running N = 2e9/8.

To present the best possible solution for Julia, I would need to find a performant solution.

piever · October 17, 2017, 11:51am

I see, if you want to optimize it as much as possible, then maybe it’d be better to extract the columns using the column function (and maybe it also helps to specify we are only using :v1 and :v3:

aggregate_vec(v -> @NT(sum = sum(column(v, :v1)), mean = mean(column(v, :v3))), DT, by =(:id3,), with = (:v1, :v3))

Though I agree that this is a reasonably common operation, so maybe the syntax could maybe be simplified a bit in IndexedTables, not sure how exactly (maybe some macro to infer that we are only using columns :v1 and :v3?)

nalimilan · October 17, 2017, 4:08pm

It would be nice to find a common interface with DataFrames, since these are really the same operation applies to two very similar structures.

piever · October 17, 2017, 4:44pm

This seems like very good timing, as both packages are revisiting their API:

https://github.com/JuliaComputing/JuliaDB.jl/issues/80

https://github.com/JuliaData/DataFrames.jl/issues/1256

Topic		Replies	Views
JuliaDBMeta/JuliaDB - How to select columns dynamically/programatically General Usage juliadb	3	999	July 9, 2019
Pulling array data from IndexedTables or JuliaDB Data	1	1094	May 11, 2017
JuliaDB groupby many columns General Usage juliadb	0	361	April 9, 2021
What's julia's solution to tapply or accumarray? General Usage question , dataframes , arrays , splitapplycombine	15	2074	January 3, 2022
Specify AsTable output column names Data dataframes	2	502	December 8, 2021

How to use aggregate_vec to create two output columnd JuliaDB.jl & IndexedTables?

Related topics