Please see my MVE for creating the data needed for this question
using Distributions, PooledArrays N=Int64(2e7); K=100; pool = [@sprintf "id%03d" k for k in 1:K] function randstrarray(pool, N) PooledArray(PooledArrays.RefArray(rand(UInt8(1):UInt8(K), N)), pool) end using JuliaDB @time DT = IndexedTable( Columns([1:N;]), Columns( id3 = randstrarray(pool, N), v1 = rand(1:5, N), # int in range [1,5] v3 = rand(round.(rand(Uniform(0,100),100),4), N) # numeric e.g. 23.5749 ));
I am trying to replicate some data.table benchmarks, in particular
system.time( DT[, list(sum(v1),mean(v3)), keyby=id3] ) system.time( DT[, lapply(.SD, sum), keyby=id6, .SDcols=7:9] )
the first one creates a sum of v1 and mean of v3 by id3. Using JuliaDB/IndexedTables it would be
dt1 = aggregate_vec(sum, DT, by =(:id3,), with =:v1) dt2 = aggregate_vec(mean, DT, by =(:id3,), with =:v3)
but I can’t figure out how to assign a name to the output (mean) column so it’s a bit tricky to merge them together given that merging the dt1 & dt2 wihtout named columns will get rid of one of the columns, e.g.
dt3 = merge(dt1, dt2) # missing one column
Also I think the below syntax for computing multiple output in one go is a good idea, hopefully I will learn enough to do a PR
dt2 = aggregate_vec([mean, sum], DT, by =(:id3,), with =(:v3, :v3)) dt2 = aggregate_vec((mean, sum), DT, by =(:id3,), with =(:v3, :v3))
I have even less of a clue of how to do this
system.time( DT[, lapply(.SD, mean), keyby=id4, .SDcols=7:9] )
which is computing the mean of the 7th -9th column of each subgroup dataframe.
Any help will be appreciated I am trying to complete a comparison of Julia vs R for data manipulations here