df = DataFrame(
a = rand(1:8, 1000),
b = rand(1:8, 1000),
c = rand(1:8, 1000),
)
combine(groupby(df, :a), nrow => :meh) # this works
function _meh(sdf)
[sdf[!, Not(:a)]]
end
combine(groupby(df, :a), _meh) # this works
combine(groupby(df, :a), _meh => :meh) # this DOESN't
The x => y syntax (with only two arguments), as it is documented, is reserved for the case where x specifies source columns and y specifies transformations.
You try x to be a transformation and y to be target column name. Allowing this would lead to ambiguity.
When you read x => y, our position is, that must be clear what this expression means without evaluating x and y (in this way it is possible to decide what code will do statically which I think is an important thing to ensure).
I have forgotten to add that nrow is an exception to this rule, as nrow is a very common usage scenario, so we have decided that we can allow for an exception in this case.
As you know docstring for combine has pages of text (as we cover so many scenarios). This is a special rule 5 in the list below (quoted from the docsting):
All these functions take a specification of one or more functions to apply to each subset of the DataFrame. This specification can be of the following forms:
1. standard column selectors (integers, Symbols, strings, vectors of integers, vectors of Symbols, vectors of strings, All, Cols, :, Between, Not and regular expressions)
2. a cols => function pair indicating that function should be called with positional arguments holding columns cols, which can be any valid column selector; in this case target
column name is automatically generated and it is assumed that function returns a single value or a vector; the generated name is created by concatenating source column name
and function name by default (see examples below).
3. a cols => function => target_cols form additionally explicitly specifying the target column or columns.
4. a col => target_cols pair, which renames the column col to target_cols, which must be single name (as a Symbol or a string), a vector of names or AsTable.
5. a nrow or nrow => target_cols form which efficiently computes the number of rows in a group; without target_cols the new column is called :nrow, otherwise it must be single
name (as a Symbol or a string).
6. vectors or matrices containing transformations specified by the Pair syntax described in points 2 to 5
7. a function which will be called with a SubDataFrame corresponding to each group; this form should be avoided due to its poor performance unless the number of groups is small
or a very large number of columns are processed (in which case SubDataFrame avoids excessive compilation)
Incidentally nrow even has a separate code path in implementation to ensure it is fast.
Hi @bkamins, is there a way to get the following to work without having to convert the vector of user IDs into a string?
I want to group by :id and the list the number of users and the vector of user IDs for each :id. However, the col => function => new_col syntax isnβt working for me unless I convert the vector of user IDs into a string. I can always then parse the column of strings using eval(Meta.parse.()), but I was wondering if there is a cleaner solution.
help?> β
"β" can be typed by \circ<tab>
search: β
f β g
Compose functions: i.e. (f β g)(args...) means f(g(args...)). The β symbol
can be entered in the Julia REPL (and most editors, appropriately
configured) by typing \circ<tab>.
You can do whatever you like. Ref just serves as a protection from broadcasting like in Julia Base.