Hello,
Thanks! This is the type of operation I was looking for.
It took me some time to understand what the two nested functions do, I will explain it here so to facilitate the future readers.
If someone knows how to do the same operations with DataFramesMeta it d be great.
Alessio
1. groupby + innerjoin
func(category0, category1) = category1 - category0
select(
innerjoin(groupby(df1, :group, sort=true)..., on = :category, makeunique=true),
:category, r"measure" => func => [:difference]
)
The groupby
function creates a list of dataframes, each for a value of :group
.
innerjoin
stack the newly created dataframes using :category
as unique id. The makeunique
argument is pivotal, and it works like this:
makeunique
: if false
(the default), an error will be raised if duplicate names are found in columns not joined on; if true
, duplicate names will be suffixed with _i
(i
starting at 1 for the first duplicate).
The resulting dataframe has columns:
[:category, :group, :group_1, :measure, :measure_1]
The select
function uses a regular expression to parse the columns that start with βmeasureβ.
The order of the stacked data frames is used to pass the columns in order. In this case, it is easy because it is just two :group
, but how to have more control on it?
2. unstack
select(
unstack(df1, :group, :measure, renamecols=x->Symbol(:_, x)),
:category, [:_0, :_1] => func => [:difference]
)
The problem of arranging columns based on their name is easier solved with unstack
.
unstack
creates new columns based on the values in :group
in the initial dataframe. I suggest reading the documentation because the positional arguments are not trivial.
The resulting dataframe has columns:
[:category, :_0, :_1]
Where _0
and _1
are derived by the values in :group
. This is convenient because the following select
function can leverage the named columns:
select(unstacked_df, :category, [:_0, :_1] => func => [:difference]