Groupby function?

While porting GAP code, I translated the function CollectBy (that I find very useful) as:

"""
  group items of list l according to the corresponding values in list v

    julia> groupby([31,28,31,30,31,30,31,31,30,31,30,31],
           [:Jan,:Feb,:Mar,:Apr,:May,:Jun,:Jul,:Aug,:Sep,:Oct,:Nov,:Dec])
    Dict{Int64,Array{Symbol,1}} with 3 entries:
      31 => Symbol[:Jan, :Mar, :May, :Jul, :Aug, :Oct, :Dec]
      28 => Symbol[:Feb]
      30 => Symbol[:Apr, :Jun, :Sep, :Nov]

"""  
function groupby(v::AbstractVector,l::AbstractVector)
  res=Dict{eltype(v),Vector{eltype(l)}}()
  for (k,val) in zip(v,l)
    push!(get!(res,k,similar(l,0)),val) 
  end
  res
end

"""
  group items of list l according to the values taken by function f on them

    julia> groupby(iseven,1:10)
    Dict{Bool,Array{Int64,1}} with 2 entries:
      false => [1, 3, 5, 7, 9]
      true  => [2, 4, 6, 8, 10]

Note:in this version l is required to be non-empty since I do not know how to
access the return type of a function
"""
function groupby(f,l::AbstractVector)
  res=Dict(f(l[1])=>[l[1]]) # l should be nonempty
  for val in l[2:end] 
    push!(get!(res,f(val),similar(l,0)),val) 
  end
  res
end

I choose the name groupby since I saw in some messages that there seems to be a function of that name
doing something similar in some package. I have several questions:

  • is there some “standard library” where such a function can be found?
  • if not, does my implementation look good? Is it possible to do better/faster?
  • In particular, is there a good way to solve the problem of accessing function return type?

Query.jl also provides an implementation for groupby. It should work with pretty much any data source.

In Query.jl it seems groupby is a macro rather than a function. How does it work in the examples I gave?

The standalone query commands version would be:

1:10 |> @groupby(iseven(_)) |> collect

You need the collect at the end because @groupby returns a lazy iterator.

If you don’t want to use the pipe syntax, you’d do:

collect(@groupby(1:10, iseven(_), _))

The third argument to @groupby there is another projection function that is applied to each element before it is placed in a group, in this case just the identity function.

The LINQ style version would be:

@from i in 1:10 begin
    @group i by iseven(i) into g
    @select g
    @collect
end

Both do the same thing.

In my packages I tend to perform a few transformations on a series of subgroups. Hence, I have those as Vector{Vector{Vector{T}}} where T <: Integer. Some transformations are passed the subgroups as obj[subgroup]. Would it be possible to use the iterators as a more efficient solution?

Building a Dict seems to me the best solution to construct the groups – though perhaps sorting could be faster.
Certainly Dict is a very flexible data structure to use the groups. I am not sure exactly what you mean —
an example of code would be useful.

You can also do that with Query.jl:

1:10 |> @groupby(iseven(_)) |> @map(_.key => collect(_)) |> Dict

See for example, JuliaEconometrics/EconUtils.jl, within.jl. Think of R’s tapply(X, INDEX, FUN) or ddply(.data, .variables, .fun). The idea is to have an object::AbstractVecOrMat, an index, and a function. The index can either be computed once and stored or be a mapping function. The function takes a subgroup of the data and is passed to the function. I usually generate the index through something like

Index = map(elem -> find(isequal(elem), obj), unique(obj))

where Index::Vector{Vector{Int64}}. In some cases for multiple dimensions I use Index::Vector{Vector{Vector{Int64}}}. A typical function would have,

function magic(obj::AbstractMatrix,
               ind::AbstractVector{<:AbstractVector{<:AbstractVector{<:Integer}}})
    for dimension ∈ groups
        for lvl ∈ dimension
            obj[lvl,:] = foo(obj[lvl,:])
        end
    end
end