 # JuliaDB: efficiently rank subgroups

Hi, I would be grateful for some advice on how to create a column containing the ordinal rank of a score grouped by another column. For example:

``````using JuliaDB
using StatsBase

Nx, Ny = 4, 3
x = repeat(1:Nx,inner=Ny) # location
y = repeat(1:Ny,outer=Nx) # date
z = rand(Nx*Ny) # score
t = table((x=x,y=y,z=z))

trk = setcol(t,:rk, (:z,:x) => row -> begin
t2 = filter(i->i.x == row.x,t)
X = select(t2,:z)
idx = findfirst(x->x==row.z,X)
rk = ordinalrank(X,rev=true)[idx]
rk
end
)
``````

I would like to perform something like the above example on 1-5 million row tables, however it does not scale so well. My initial queries are:

It looks like you are looking for a `groupby`:

``````# compute the rank by x and y
trk = groupby(t, (:x, :y), select = :z, flatten = true) do v
Columns(rk = ordinalrank(v, rev=true), z=v)
end
``````

You need to wrap things that you return in a `Columns` because you need to return something that iterates rows to be able to flatten correctly.

Note that in general, if the first two columns represent location and date, you may want to use them as primary keys, so `t1 = reindex(t, (:x, :y))` and then you work on it. Grouping there defaults on the primary keys, so you would just need:

``````trk = groupby(t1, select = :z, flatten = true) do v
Columns(rk = ordinalrank(v, rev=true), z=v)
end
``````

To parallelize, yes, you would just start julia with say 4 processes, then do:

``````t1 = reindex(t, (:x, :y))
t_dist = distribute(t1, 4) # or whichever number of chunks works best
trk = groupby(t_dist, select = :z, flatten = true) do v
Columns(rk = ordinalrank(v, rev=true), z=v)
end
``````

but you probably don’t need to parallelize with 5 million rows.

2 Likes

Thank you very much! That is exactly the kind of advice I was seeking.
In case anyone is interested in this example, the ranking I was attempting was on `:x` only:

``````trk = groupby(t, :x, select = :z, flatten = true) do v
Columns(rk = ordinalrank(v, rev=true), z=v)
end
`````` 1 Like