Please see my MVE for creating the data needed for this question
using Distributions, PooledArrays
N=Int64(2e7); K=100;
pool = [@sprintf "id%03d" k for k in 1:K]
function randstrarray(pool, N)
PooledArray(PooledArrays.RefArray(rand(UInt8(1):UInt8(K), N)), pool)
end
using JuliaDB
@time DT = IndexedTable(
Columns([1:N;]),
Columns(
id3 = randstrarray(pool, N),
v1 = rand(1:5, N), # int in range [1,5]
v3 = rand(round.(rand(Uniform(0,100),100),4), N) # numeric e.g. 23.5749
));
I am trying to replicate some data.table benchmarks, in particular
system.time( DT[, list(sum(v1),mean(v3)), keyby=id3] )
system.time( DT[, lapply(.SD, sum), keyby=id6, .SDcols=7:9] )
the first one creates a sum of v1 and mean of v3 by id3. Using JuliaDB/IndexedTables it would be
dt1 = aggregate_vec(sum, DT, by =(:id3,), with =:v1)
dt2 = aggregate_vec(mean, DT, by =(:id3,), with =:v3)
but I can’t figure out how to assign a name to the output (mean) column so it’s a bit tricky to merge them together given that merging the dt1 & dt2 wihtout named columns will get rid of one of the columns, e.g.
dt3 = merge(dt1, dt2) # missing one column
Also I think the below syntax for computing multiple output in one go is a good idea, hopefully I will learn enough to do a PR
dt2 = aggregate_vec([mean, sum], DT, by =(:id3,), with =(:v3, :v3))
dt2 = aggregate_vec((mean, sum), DT, by =(:id3,), with =(:v3, :v3))
I have even less of a clue of how to do this
system.time( DT[, lapply(.SD, mean), keyby=id4, .SDcols=7:9] )
which is computing the mean of the 7th -9th column of each subgroup dataframe.
Any help will be appreciated I am trying to complete a comparison of Julia vs R for data manipulations here