Merry Chirstmas everybody,
Today I have released version 1.0.0 of SplitApplyCombine.jl, with the headline feature being that grouping functions have moved to returning a dictionary from Dictionaries.jl.
On a personal note, this is a big deal for me. Let me explain. While I found I used the previous group
function from SplitApplyCombine quite frequently to analyse and explore data, once I had the groups it was difficult do perform further analysis.
julia> using SplitApplyCombine, Statistics
julia> group(iseven, 1:10)
Dict{Bool,Array{Int64,1}} with 2 entries:
false => [1, 3, 5, 7, 9]
true => [2, 4, 6, 8, 10]
julia> mean.(group(iseven, 1:10))
ERROR: ArgumentError: broadcasting over dictionaries and `NamedTuple`s is reserved
Stacktrace:
[1] broadcastable(::Dict{Bool,Array{Int64,1}}) at ./broadcast.jl:618
[2] broadcasted(::Function, ::Dict{Bool,Array{Int64,1}}) at ./broadcast.jl:1166
[3] top-level scope at none:0
While one can construct a new Dict
with what I want, it was inconvenient enough to add a bunch of convenience functions like groupsum
and groupcount
(well, there are performance reasons for this, too). In fact, groupcount
may be one of my most used functions when exploring data:
julia> groupcount(iseven, 1:10)
Dict{Bool,Int64} with 2 entries:
false => 5
true => 5
In any case, the difficulties were enough to spawn my interest in the dictionary interface and ultimately lead to Dictionaries.jl, which was a bunch of work. But at the end of the day it was motivated by my desire to find e.g. the mean
of group
s, so without further ado, this is what you get with SplitApplyCombine 1.0.0:
julia> group(iseven, 1:10)
2-element Dictionaries.HashDictionary{Bool,Array{Int64,1}}
false โ [1, 3, 5, 7, 9]
true โ [2, 4, 6, 8, 10]
julia> mean.(group(iseven, 1:10))
2-element Dictionaries.HashDictionary{Bool,Float64}
false โ 5.0
true โ 6.0
Hurray! Merry Christmas, me!
I also addressed another usability concern - while sometimes determining the grouping keys via a function group(by::Function, a)
you may already know the groups. A common case is grouping data by a column of a dataframe, so letโs look at that.
In the spirit of the day, I found a CSV file on the internet called Christmas.csv. Itโs about Christmas-themed advertisements, and here is a sample:
shell> head -5 Christmas.csv
Name,Brand,Decade,Image.src,Find more here
7up (1948),7up,1940,http://file.vintageadbrowser.com/0clbm89x7h7efw.jpg,http://www.vintageadbrowser.com/xmas-ads-1940s/14
7-Up Soda Bottle Santa Hand Christmas (1964),7up,1960,http://file.vintageadbrowser.com/lzfxsdpk7tz7kp.jpg,http://www.vintageadbrowser.com/xmas-ads-1960s
"A.H. Grebe & Companyโs Radio โ It is our sincere hope that the gifts you make this Christmas may bring to the little worlds into which they go, something of the joy and happiness t (1927)",A.H. Grebe & Companyโs Radio,1920,http://file.vintageadbrowser.com/rczr6m2331lbn6.jpg,http://www.vintageadbrowser.com/xmas-ads-1920s/2
Christmas Tree Art A&p Coffee (1958),A&P,1950,http://file.vintageadbrowser.com/hf83k5lw0icn9h.jpg,http://www.vintageadbrowser.com/xmas-ads-1950s/9
We can read it into the REPL - itโs a bit wide to see all the columns of all the rows, but here is the first row.
julia> using SplitApplyCombine, CSV, Statistics
julia> df = CSV.read("Christmas.csv"); df[1, :]
DataFrameRow
โ Row โ Name โ Brand โ Decade โ Image.src โ Find more here โ
โ โ String โ String โ Int64 โ String โ String โ
โโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโผโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ 1 โ 7up (1948) โ 7up โ 1940 โ http://file.vintageadbrowser.com/0clbm89x7h7efw.jpg โ http://www.vintageadbrowser.com/xmas-ads-1940s/14 โ
The first thing I would do with a dataset like this is try to understand some basic distributions, like how often does a brand appear in the list?
julia> counts = groupcount(df.Brand)
279-element Dictionaries.HashDictionary{String,Int64}
"Lejon" โ 1
"Ford Motor Company" โ 1
"Texaco" โ 1
"Hoover" โ 1
"Philip's" โ 1
"PG&E" โ 1
"Nunnaly" โ 1
"Wilcox-Gay" โ 1
"Alcoa Aluminum" โ 1
"DuMont" โ 1
"General Electric" โ 7
"American Greeting" โ 1
"Barbasol" โ 1
"Hardware Mutual" โ 1
โฎ โ โฎ
julia> mean(counts)
2.010752688172043
julia> findmax(counts)
(42, "Kodak")
I guess Kodak figured out that people like to take photographs at Christmas time? Note that while findmax
works fine on Dict
, the mean
requires mean(values(counts))
. A minor detail to be sure, but every bit of convenience helps.
The newer functionality comes in the form of group(groups, values)
, where the first is a collection of the same size as the second, which seems particularly useful with tabular data.
julia> group(df.Brand, df.Decade)
279-element Dictionaries.HashDictionary{String,Array{Int64,1}}
"Lejon" โ [1940]
"Ford Motor Company" โ [1950]
"Texaco" โ [1960]
"Hoover" โ [1960]
"Philip's" โ [1940]
"PG&E" โ [1960]
"Nunnaly" โ [1920]
"Wilcox-Gay" โ [1940]
"Alcoa Aluminum" โ [1960]
"DuMont" โ [1940]
"General Electric" โ [1940, 1940, 1940, 1940, 1940, 1940, 1960]
"American Greeting" โ [1940]
"Barbasol" โ [1940]
"Hardware Mutual" โ [1950]
โฎ โ โฎ
We also have new functions groupunique
and grouponly
, which may be somewhat useful in some situations. For example:
julia> groupunique(df.Decade, df.Brand)
10-element Dictionaries.HashDictionary{Int64,Dictionaries.HashIndices{String}}
1980 โ {"Alexander OโNeal", "Absolut Vodka", "Nynex", "Seagram's", "Cutty Sark", "Baileys", "Aโฆ
1990 โ {"Barbie", "Absolut Vodka", "Bucks", "BlueBlocker", "Hewlett Packard", "Jack Danielโs",โฆ
1930 โ {"Camel", "Four Roses", "The Etude Music Magazine", "Elgin Watch", "Underwood", "Hamiltโฆ
1970 โ {"Max Factor", "Chivas Regal", "Jerry Silverman", "Northwest Christmas Tree Associationโฆ
1900 โ {"Kodak", "Wanamaker", "H. OโNeill & Co.", "Other", "Gates Potteries", "Citizens Nationโฆ
1920 โ {"Johnstonโs", "Willys-Overland Six", "Camel", "Atwater", "A.H. Grebe & Companyโs Radioโฆ
1960 โ {"Tiffany", "Max Factor", "AMF", "Guerlain", "Revlon", "Texaco", "Four Roses", "Hoover"โฆ
1940 โ {"Lejon", "Sportsman", "Sunbeam", "Philip's", "Arrow", "Jantzen", "Pennsylvania Railroaโฆ
1910 โ {"Kodak", "Other", "Blue Bird", "Larkin Factory", "FDT Florist"}
1950 โ {"Ford Motor Company", "Air Express", "Jell-o", "Arrow", "New York Central Railroad", "โฆ
julia> length.(ans)
10-element Dictionaries.HashDictionary{Int64,Int64}
1980 โ 15
1990 โ 9
1930 โ 26
1970 โ 19
1900 โ 6
1920 โ 21
1960 โ 35
1940 โ 137
1910 โ 5
1950 โ 91
Anyway, thatโs it for now. I feel the future holds some interesting work around tables which behave like dictionaries, contain primary keys or are partitioned, and grouping functions that return flattened containers in a similar vein to SQL and DataFrames.groupby
.
Happy holidays!
Andy