Groupby function?


#1

While porting GAP code, I translated the function CollectBy (that I find very useful) as:

"""
  group items of list l according to the corresponding values in list v

    julia> groupby([31,28,31,30,31,30,31,31,30,31,30,31],
           [:Jan,:Feb,:Mar,:Apr,:May,:Jun,:Jul,:Aug,:Sep,:Oct,:Nov,:Dec])
    Dict{Int64,Array{Symbol,1}} with 3 entries:
      31 => Symbol[:Jan, :Mar, :May, :Jul, :Aug, :Oct, :Dec]
      28 => Symbol[:Feb]
      30 => Symbol[:Apr, :Jun, :Sep, :Nov]

"""  
function groupby(v::AbstractVector,l::AbstractVector)
  res=Dict{eltype(v),Vector{eltype(l)}}()
  for (k,val) in zip(v,l)
    push!(get!(res,k,similar(l,0)),val) 
  end
  res
end

"""
  group items of list l according to the values taken by function f on them

    julia> groupby(iseven,1:10)
    Dict{Bool,Array{Int64,1}} with 2 entries:
      false => [1, 3, 5, 7, 9]
      true  => [2, 4, 6, 8, 10]

Note:in this version l is required to be non-empty since I do not know how to
access the return type of a function
"""
function groupby(f,l::AbstractVector)
  res=Dict(f(l[1])=>[l[1]]) # l should be nonempty
  for val in l[2:end] 
    push!(get!(res,f(val),similar(l,0)),val) 
  end
  res
end

I choose the name groupby since I saw in some messages that there seems to be a function of that name
doing something similar in some package. I have several questions:

  • is there some “standard library” where such a function can be found?
  • if not, does my implementation look good? Is it possible to do better/faster?
  • In particular, is there a good way to solve the problem of accessing function return type?

#2

Query.jl also provides an implementation for groupby. It should work with pretty much any data source.


#3

In Query.jl it seems groupby is a macro rather than a function. How does it work in the examples I gave?


#4

The standalone query commands version would be:

1:10 |> @groupby(iseven(_)) |> collect

You need the collect at the end because @groupby returns a lazy iterator.

If you don’t want to use the pipe syntax, you’d do:

collect(@groupby(1:10, iseven(_), _))

The third argument to @groupby there is another projection function that is applied to each element before it is placed in a group, in this case just the identity function.

The LINQ style version would be:

@from i in 1:10 begin
    @group i by iseven(i) into g
    @select g
    @collect
end

Both do the same thing.


#5

In my packages I tend to perform a few transformations on a series of subgroups. Hence, I have those as Vector{Vector{Vector{T}}} where T <: Integer. Some transformations are passed the subgroups as obj[subgroup]. Would it be possible to use the iterators as a more efficient solution?


#6

Building a Dict seems to me the best solution to construct the groups – though perhaps sorting could be faster.
Certainly Dict is a very flexible data structure to use the groups. I am not sure exactly what you mean —
an example of code would be useful.


#7

You can also do that with Query.jl:

1:10 |> @groupby(iseven(_)) |> @map(_.key => collect(_)) |> Dict

#8

See for example, JuliaEconometrics/EconUtils.jl, within.jl. Think of R’s tapply(X, INDEX, FUN) or ddply(.data, .variables, .fun). The idea is to have an object::AbstractVecOrMat, an index, and a function. The index can either be computed once and stored or be a mapping function. The function takes a subgroup of the data and is passed to the function. I usually generate the index through something like

Index = map(elem -> find(isequal(elem), obj), unique(obj))

where Index::Vector{Vector{Int64}}. In some cases for multiple dimensions I use Index::Vector{Vector{Vector{Int64}}}. A typical function would have,

function magic(obj::AbstractMatrix,
               ind::AbstractVector{<:AbstractVector{<:AbstractVector{<:Integer}}})
    for dimension ∈ groups
        for lvl ∈ dimension
            obj[lvl,:] = foo(obj[lvl,:])
        end
    end
end