SplitApplyCombine.jl `group` enhancements, reaches version 1.0.0

Merry Chirstmas everybody,

Today I have released version 1.0.0 of SplitApplyCombine.jl, with the headline feature being that grouping functions have moved to returning a dictionary from Dictionaries.jl.

On a personal note, this is a big deal for me. Let me explain. While I found I used the previous group function from SplitApplyCombine quite frequently to analyse and explore data, once I had the groups it was difficult do perform further analysis.

julia> using SplitApplyCombine, Statistics

julia> group(iseven, 1:10)
Dict{Bool,Array{Int64,1}} with 2 entries:
  false => [1, 3, 5, 7, 9]
  true  => [2, 4, 6, 8, 10]

julia> mean.(group(iseven, 1:10))
ERROR: ArgumentError: broadcasting over dictionaries and `NamedTuple`s is reserved
 [1] broadcastable(::Dict{Bool,Array{Int64,1}}) at ./broadcast.jl:618
 [2] broadcasted(::Function, ::Dict{Bool,Array{Int64,1}}) at ./broadcast.jl:1166
 [3] top-level scope at none:0

While one can construct a new Dict with what I want, it was inconvenient enough to add a bunch of convenience functions like groupsum and groupcount (well, there are performance reasons for this, too). In fact, groupcount may be one of my most used functions when exploring data:

julia> groupcount(iseven, 1:10)
Dict{Bool,Int64} with 2 entries:
  false => 5
  true  => 5

In any case, the difficulties were enough to spawn my interest in the dictionary interface and ultimately lead to Dictionaries.jl, which was a bunch of work. But at the end of the day it was motivated by my desire to find e.g. the mean of groups, so without further ado, this is what you get with SplitApplyCombine 1.0.0:

julia> group(iseven, 1:10)
2-element Dictionaries.HashDictionary{Bool,Array{Int64,1}}
 false โ”‚ [1, 3, 5, 7, 9]
  true โ”‚ [2, 4, 6, 8, 10]

julia> mean.(group(iseven, 1:10))
2-element Dictionaries.HashDictionary{Bool,Float64}
 false โ”‚ 5.0
  true โ”‚ 6.0

Hurray! Merry Christmas, me! :slight_smile:

I also addressed another usability concern - while sometimes determining the grouping keys via a function group(by::Function, a) you may already know the groups. A common case is grouping data by a column of a dataframe, so letโ€™s look at that.

In the spirit of the day, I found a CSV file on the internet called Christmas.csv. Itโ€™s about Christmas-themed advertisements, and here is a sample:

shell> head -5 Christmas.csv
Name,Brand,Decade,Image.src,Find more here
7up (1948),7up,1940,http://file.vintageadbrowser.com/0clbm89x7h7efw.jpg,http://www.vintageadbrowser.com/xmas-ads-1940s/14
7-Up Soda Bottle Santa Hand Christmas (1964),7up,1960,http://file.vintageadbrowser.com/lzfxsdpk7tz7kp.jpg,http://www.vintageadbrowser.com/xmas-ads-1960s
"A.H. Grebe & Companyโ€™s Radio โ€“ It is our sincere hope that the gifts you make this Christmas may bring to the little worlds into which they go, something of the joy and happiness t (1927)",A.H. Grebe & Companyโ€™s Radio,1920,http://file.vintageadbrowser.com/rczr6m2331lbn6.jpg,http://www.vintageadbrowser.com/xmas-ads-1920s/2
Christmas Tree Art A&p Coffee (1958),A&P,1950,http://file.vintageadbrowser.com/hf83k5lw0icn9h.jpg,http://www.vintageadbrowser.com/xmas-ads-1950s/9

We can read it into the REPL - itโ€™s a bit wide to see all the columns of all the rows, but here is the first row.

julia> using SplitApplyCombine, CSV, Statistics

julia> df = CSV.read("Christmas.csv"); df[1, :]
โ”‚ Row โ”‚ Name       โ”‚ Brand  โ”‚ Decade โ”‚ Image.src                                           โ”‚ Find more here                                    โ”‚
โ”‚     โ”‚ String     โ”‚ String โ”‚ Int64  โ”‚ String                                              โ”‚ String                                            โ”‚
โ”‚ 1   โ”‚ 7up (1948) โ”‚ 7up    โ”‚ 1940   โ”‚ http://file.vintageadbrowser.com/0clbm89x7h7efw.jpg โ”‚ http://www.vintageadbrowser.com/xmas-ads-1940s/14 โ”‚

The first thing I would do with a dataset like this is try to understand some basic distributions, like how often does a brand appear in the list?

julia> counts = groupcount(df.Brand)
279-element Dictionaries.HashDictionary{String,Int64}
              "Lejon" โ”‚ 1
 "Ford Motor Company" โ”‚ 1
             "Texaco" โ”‚ 1
             "Hoover" โ”‚ 1
           "Philip's" โ”‚ 1
               "PG&E" โ”‚ 1
            "Nunnaly" โ”‚ 1
         "Wilcox-Gay" โ”‚ 1
     "Alcoa Aluminum" โ”‚ 1
             "DuMont" โ”‚ 1
   "General Electric" โ”‚ 7
  "American Greeting" โ”‚ 1
           "Barbasol" โ”‚ 1
    "Hardware Mutual" โ”‚ 1
                    โ‹ฎ โ”‚ โ‹ฎ

julia> mean(counts)

julia> findmax(counts)
(42, "Kodak")

I guess Kodak figured out that people like to take photographs at Christmas time? :slight_smile: Note that while findmax works fine on Dict, the mean requires mean(values(counts)). A minor detail to be sure, but every bit of convenience helps.

The newer functionality comes in the form of group(groups, values), where the first is a collection of the same size as the second, which seems particularly useful with tabular data.

julia> group(df.Brand, df.Decade)
279-element Dictionaries.HashDictionary{String,Array{Int64,1}}
              "Lejon" โ”‚ [1940]
 "Ford Motor Company" โ”‚ [1950]
             "Texaco" โ”‚ [1960]
             "Hoover" โ”‚ [1960]
           "Philip's" โ”‚ [1940]
               "PG&E" โ”‚ [1960]
            "Nunnaly" โ”‚ [1920]
         "Wilcox-Gay" โ”‚ [1940]
     "Alcoa Aluminum" โ”‚ [1960]
             "DuMont" โ”‚ [1940]
   "General Electric" โ”‚ [1940, 1940, 1940, 1940, 1940, 1940, 1960]
  "American Greeting" โ”‚ [1940]
           "Barbasol" โ”‚ [1940]
    "Hardware Mutual" โ”‚ [1950]
                    โ‹ฎ โ”‚ โ‹ฎ

We also have new functions groupunique and grouponly, which may be somewhat useful in some situations. For example:

julia> groupunique(df.Decade, df.Brand)
10-element Dictionaries.HashDictionary{Int64,Dictionaries.HashIndices{String}}
 1980 โ”‚ {"Alexander Oโ€™Neal", "Absolut Vodka", "Nynex", "Seagram's", "Cutty Sark", "Baileys", "Aโ€ฆ
 1990 โ”‚ {"Barbie", "Absolut Vodka", "Bucks", "BlueBlocker", "Hewlett Packard", "Jack Danielโ€™s",โ€ฆ
 1930 โ”‚ {"Camel", "Four Roses", "The Etude Music Magazine", "Elgin Watch", "Underwood", "Hamiltโ€ฆ
 1970 โ”‚ {"Max Factor", "Chivas Regal", "Jerry Silverman", "Northwest Christmas Tree Associationโ€ฆ
 1900 โ”‚ {"Kodak", "Wanamaker", "H. Oโ€™Neill & Co.", "Other", "Gates Potteries", "Citizens Nationโ€ฆ
 1920 โ”‚ {"Johnstonโ€™s", "Willys-Overland Six", "Camel", "Atwater", "A.H. Grebe & Companyโ€™s Radioโ€ฆ
 1960 โ”‚ {"Tiffany", "Max Factor", "AMF", "Guerlain", "Revlon", "Texaco", "Four Roses", "Hoover"โ€ฆ
 1940 โ”‚ {"Lejon", "Sportsman", "Sunbeam", "Philip's", "Arrow", "Jantzen", "Pennsylvania Railroaโ€ฆ
 1910 โ”‚ {"Kodak", "Other", "Blue Bird", "Larkin Factory", "FDT Florist"}
 1950 โ”‚ {"Ford Motor Company", "Air Express", "Jell-o", "Arrow", "New York Central Railroad", "โ€ฆ

julia> length.(ans)
10-element Dictionaries.HashDictionary{Int64,Int64}
 1980 โ”‚ 15
 1990 โ”‚ 9
 1930 โ”‚ 26
 1970 โ”‚ 19
 1900 โ”‚ 6
 1920 โ”‚ 21
 1960 โ”‚ 35
 1940 โ”‚ 137
 1910 โ”‚ 5
 1950 โ”‚ 91

Anyway, thatโ€™s it for now. I feel the future holds some interesting work around tables which behave like dictionaries, contain primary keys or are partitioned, and grouping functions that return flattened containers in a similar vein to SQL and DataFrames.groupby.

Happy holidays!