df = DataFrame(
a = rand(1:8, 1000),
b = rand(1:8, 1000),
c = rand(1:8, 1000),
combine(groupby(df, :a), nrow => :meh) # this works
combine(groupby(df, :a), _meh) # this works
combine(groupby(df, :a), _meh => :meh) # this DOESN't
The x => y syntax (with only two arguments), as it is documented, is reserved for the case where x specifies source columns and y specifies transformations.
You try x to be a transformation and y to be target column name. Allowing this would lead to ambiguity.
When you read x => y, our position is, that must be clear what this expression means without evaluating x and y (in this way it is possible to decide what code will do statically which I think is an important thing to ensure).
As you know docstring for combine has pages of text (as we cover so many scenarios). This is a special rule 5 in the list below (quoted from the docsting):
All these functions take a specification of one or more functions to apply to each subset of the DataFrame. This specification can be of the following forms:
1. standard column selectors (integers, Symbols, strings, vectors of integers, vectors of Symbols, vectors of strings, All, Cols, :, Between, Not and regular expressions)
2. a cols => function pair indicating that function should be called with positional arguments holding columns cols, which can be any valid column selector; in this case target
column name is automatically generated and it is assumed that function returns a single value or a vector; the generated name is created by concatenating source column name
and function name by default (see examples below).
3. a cols => function => target_cols form additionally explicitly specifying the target column or columns.
4. a col => target_cols pair, which renames the column col to target_cols, which must be single name (as a Symbol or a string), a vector of names or AsTable.
5. a nrow or nrow => target_cols form which efficiently computes the number of rows in a group; without target_cols the new column is called :nrow, otherwise it must be single
name (as a Symbol or a string).
6. vectors or matrices containing transformations specified by the Pair syntax described in points 2 to 5
7. a function which will be called with a SubDataFrame corresponding to each group; this form should be avoided due to its poor performance unless the number of groups is small
or a very large number of columns are processed (in which case SubDataFrame avoids excessive compilation)
Incidentally nrow even has a separate code path in implementation to ensure it is fast.
Hi @bkamins, is there a way to get the following to work without having to convert the vector of user IDs into a string?
I want to group by :id and the list the number of users and the vector of user IDs for each :id. However, the col => function => new_col syntax isn’t working for me unless I convert the vector of user IDs into a string. I can always then parse the column of strings using eval(Meta.parse.()), but I was wondering if there is a cleaner solution.
"∘" can be typed by \circ<tab>
f ∘ g
Compose functions: i.e. (f ∘ g)(args...) means f(g(args...)). The ∘ symbol
can be entered in the Julia REPL (and most editors, appropriately
configured) by typing \circ<tab>.
You can do whatever you like. Ref just serves as a protection from broadcasting like in Julia Base.