[ANN] FlexiGroups.jl -- composable and general dataset group-bys

Arrange tabular or non-tabular datasets into groups according to a specified key function.
Now registered in General

The basic principle of FlexiGroups is that the result of a grouping operation is always a collection of groups, and each group is a collection of elements. Groups are typically indexed by the grouping key.
That’s similar to the group(...) interface in SplitApplyCombine.jl, see differences in the README.

This interface makes FlexiGroups compatible and composable with a wide range of collection and table types, and with generic data processing functions such as map and filter from Base.
Companion packages following similar design approaches: FlexiMaps and FlexiJoins.

The main workhorse is the group function:
group([keyf=identity], X; [restype=Dictionary]), it groups elements of X by keyf(x), returning a mapping of keys to lists of values in each group:

julia> using FlexiGroups

julia> xs = 3 .* [1, 2, 3, 4, 5]

julia> g = group(isodd, xs)
2-element Dictionary
  true β”‚ [3, 9, 15]
 false β”‚ [6, 12]

The result is an (ordered) Dictionary by default, but can be changed to the base Dict or another dictionary type. Alternatively to dictionaries, specifying restype=KeyedArray (from AxisKeys.jl) results in a KeyedArray. Its axiskeys are the group keys.

For details and more features/examples:

  • views,
  • margins,
  • pivot tables,
  • by-group transformations,

see the docs.

7 Likes

Update: FlexiGroups 0.1.8

New and improved features in this version:

Margins β€” now with KeyedArrays

The addmargins function is now exported from FlexiGroups, and supports both dictionaries and keyed arrays. This is convenient for summary tables with arbitrary number of variables/dimensions.
Two-variable example:

julia> using FlexiGroups, AxisKeys

# some data:
julia> x = rand(1:10, 10)

# group xs by isodd(x) and x % 3 == 0, count elements in each group:
julia> cnts = groupmap(
           x -> (odd=isodd(x), div3=x % 3 == 0),
           length, x;
           restype=KeyedArray, default=0
       )
2-dimensional KeyedArray(NamedDimsArray(...)) with keys:
↓   odd ∈ 2-element Vector{Bool}
β†’   div3 ∈ 2-element Vector{Bool}
And data, 2Γ—2 Matrix{Int64}:
          (false)  (true)
 (false)        3       0
  (true)        5       2

# add margins: sum of counts along each dimension, and the total count
julia> cnts_m = addmargins(cnts, combine=sum)
2-dimensional KeyedArray(NamedDimsArray(...)) with keys:
↓   odd ∈ 3-element Vector{Union{FlexiGroups.MarginKey, Bool}}
β†’   div3 ∈ 3-element Vector{Union{FlexiGroups.MarginKey, Bool}}
And data, 3Γ—3 Matrix{Int64}:
               (false)  (true)  (total)
 (false)             3       0   3
  (true)             5       2   7
      (total)        8       2  10

# the result is a regular keyed array, supports common operations
# such as key-based access:
julia> cnts_m(odd=total, div3=true)
2

CategoricalValues β€” keep all levels

When the grouping key is a CategoricalValue (from CategoricalArrays.jl), all potential levels are kept in the result:

julia> using FlexiGroups, CategoricalArrays, StructArrays

# columns a and b have the same values,
# but b is categorical with three levels (1, 2, 3)
julia> tbl = StructArray(a=[1, 2, 1, 3], b=CategoricalArray([1, 2, 1, 3]))[1:3]
3-element StructArray(::Vector{Int64}, ::CategoricalVector{Int64, UInt32, Int64, CategoricalValue{Int64, UInt32}, Union{}}) with eltype NamedTuple{(:a, :b), Tuple{Int64, CategoricalValue{Int64, UInt32}}}:
 (a = 1, b = 1)
 (a = 2, b = 2)
 (a = 1, b = 1)

# group by regular values: only a=1 and a=2 are present
julia> groupmap(x -> x.a, length, tbl)
2-element Dictionaries.Dictionary{Int64, Int64}
 1 β”‚ 2
 2 β”‚ 1

# group by categorical values: it knows about the b=3 level
# b=3 group is empty, so the calculated length is zero:
julia> groupmap(x -> x.b, length, tbl)
3-element Dictionaries.Dictionary{Int64, Int64}
 1 β”‚ 2
 2 β”‚ 1
 3 β”‚ 0