JuliaDB: efficiently rank subgroups

jade_mackay · May 17, 2019, 2:26am

Hi, I would be grateful for some advice on how to create a column containing the ordinal rank of a score grouped by another column. For example:

using JuliaDB
using StatsBase

Nx, Ny = 4, 3
x = repeat(1:Nx,inner=Ny) # location
y = repeat(1:Ny,outer=Nx) # date
z = rand(Nx*Ny) # score
t = table((x=x,y=y,z=z))

trk = setcol(t,:rk, (:z,:x) => row -> begin
             t2 = filter(i->i.x == row.x,t)
             X = select(t2,:z)
             idx = findfirst(x->x==row.z,X)
             rk = ordinalrank(X,rev=true)[idx]
             rk
             end
             )

I would like to perform something like the above example on 1-5 million row tables, however it does not scale so well. My initial queries are:

Is there a more sensible implementation?
Should I parallelize it as per Parallel Computing · The Julia Language?
Or is it better to let JuliaDB handle parallelization by itself?
Thanks!

piever · May 19, 2019, 10:54pm

It looks like you are looking for a groupby:

# compute the rank by x and y
trk = groupby(t, (:x, :y), select = :z, flatten = true) do v
    Columns(rk = ordinalrank(v, rev=true), z=v)
end

You need to wrap things that you return in a Columns because you need to return something that iterates rows to be able to flatten correctly.

Note that in general, if the first two columns represent location and date, you may want to use them as primary keys, so t1 = reindex(t, (:x, :y)) and then you work on it. Grouping there defaults on the primary keys, so you would just need:

trk = groupby(t1, select = :z, flatten = true) do v
    Columns(rk = ordinalrank(v, rev=true), z=v)
end

To parallelize, yes, you would just start julia with say 4 processes, then do:

t1 = reindex(t, (:x, :y))
t_dist = distribute(t1, 4) # or whichever number of chunks works best
trk = groupby(t_dist, select = :z, flatten = true) do v
    Columns(rk = ordinalrank(v, rev=true), z=v)
end

but you probably don’t need to parallelize with 5 million rows.

jade_mackay · May 20, 2019, 1:04am

Thank you very much! That is exactly the kind of advice I was seeking.
In case anyone is interested in this example, the ranking I was attempting was on :x only:

trk = groupby(t, :x, select = :z, flatten = true) do v
    Columns(rk = ordinalrank(v, rev=true), z=v)
end

Topic		Replies	Views
Ranking of Dataframe? New to Julia dataframes	8	1044	July 22, 2021
Counting Occurrences in JuliaDB New to Julia juliadb	3	2028	March 11, 2019
JuliaDB groupby many columns General Usage juliadb	0	361	April 9, 2021
Most efficient way to compute number of rows in each group using JuliaDB? Data juliadb	0	580	September 1, 2019
When appllying roll rank function, how much faster Julia can be compared to Python? New to Julia	19	630	March 18, 2022

JuliaDB: efficiently rank subgroups

Related topics