Why doesn't this work with DataFrames.jl `combine`?

This works

df = DataFrame(
        a = rand(1:8, 1000),
        b = rand(1:8, 1000),
        c = rand(1:8, 1000),
    )

combine(groupby(df, :a), nrow => :meh) # this works

function _meh(sdf)
  [sdf[!, Not(:a)]]
end

combine(groupby(df, :a), _meh) # this works

combine(groupby(df, :a), _meh => :meh) # this DOESN't

Why is this? Is this a bug?

The x => y syntax (with only two arguments), as it is documented, is reserved for the case where x specifies source columns and y specifies transformations.

You try x to be a transformation and y to be target column name. Allowing this would lead to ambiguity.

When you read x => y, our position is, that must be clear what this expression means without evaluating x and y (in this way it is possible to decide what code will do statically which I think is an important thing to ensure).

3 Likes

Do

function _meh(sdf)
    (meh = [sdf[!, Not(:a)]],)
end

combine(groupby(df, :a), _meh) # this works

instead.

2 Likes

I have forgotten to add that nrow is an exception to this rule, as nrow is a very common usage scenario, so we have decided that we can allow for an exception in this case.

1 Like

Ah, I thought the exception was the rule.

As you know docstring for combine has pages of text (as we cover so many scenarios). This is a special rule 5 in the list below (quoted from the docsting):

  All these functions take a specification of one or more functions to apply to each subset of the DataFrame. This specification can be of the following forms:

    1. standard column selectors (integers, Symbols, strings, vectors of integers, vectors of Symbols, vectors of strings, All, Cols, :, Between, Not and regular expressions)

    2. a cols => function pair indicating that function should be called with positional arguments holding columns cols, which can be any valid column selector; in this case target
       column name is automatically generated and it is assumed that function returns a single value or a vector; the generated name is created by concatenating source column name
       and function name by default (see examples below).

    3. a cols => function => target_cols form additionally explicitly specifying the target column or columns.

    4. a col => target_cols pair, which renames the column col to target_cols, which must be single name (as a Symbol or a string), a vector of names or AsTable.

    5. a nrow or nrow => target_cols form which efficiently computes the number of rows in a group; without target_cols the new column is called :nrow, otherwise it must be single
       name (as a Symbol or a string).

    6. vectors or matrices containing transformations specified by the Pair syntax described in points 2 to 5

    7. a function which will be called with a SubDataFrame corresponding to each group; this form should be avoided due to its poor performance unless the number of groups is small
       or a very large number of columns are processed (in which case SubDataFrame avoids excessive compilation)

Incidentally nrow even has a separate code path in implementation to ensure it is fast.

2 Likes

Hi @bkamins, is there a way to get the following to work without having to convert the vector of user IDs into a string?

I want to group by :id and the list the number of users and the vector of user IDs for each :id. However, the col => function => new_col syntax isn’t working for me unless I convert the vector of user IDs into a string. I can always then parse the column of strings using eval(Meta.parse.()), but I was wondering if there is a cleaner solution.

Thanks!

julia> using DataFrames

julia> df = DataFrame(id = [1,1,1,2,2], user = [101,102,103,104,105])
5Γ—2 DataFrame
 Row β”‚ id     user  
     β”‚ Int64  Int64 
─────┼──────────────
   1 β”‚     1    101
   2 β”‚     1    102
   3 β”‚     1    103
   4 β”‚     2    104
   5 β”‚     2    105

julia> gdf = groupby(df, :id);

julia> df2 = combine(gdf, nrow => :number_of_users, :user => (x -> collect(x)) => :user_list)
5Γ—3 DataFrame
 Row β”‚ id     number_of_users  user_list 
     β”‚ Int64  Int64            Int64     
─────┼───────────────────────────────────
   1 β”‚     1                3        101
   2 β”‚     1                3        102
   3 β”‚     1                3        103
   4 β”‚     2                2        104
   5 β”‚     2                2        105

julia> df2 = combine(gdf, nrow => :number_of_users, :user => (x -> string(collect(x))) => :user_list) 
2Γ—3 DataFrame
 Row β”‚ id     number_of_users  user_list       
     β”‚ Int64  Int64            String
─────┼─────────────────────────────────────────
   1 β”‚     1                3  [101, 102, 103]
   2 β”‚     2                2  [104, 105]

You want string.(collect(x)). Remember the broadcasting.

Arrays can be converted to string, which is the behavior you are seeing here.

This is a non-copying approach:

julia> df2 = combine(gdf, nrow => :number_of_users, :user => Ref => :user_list)
2Γ—3 DataFrame
 Row β”‚ id     number_of_users  user_list
     β”‚ Int64  Int64            SubArray…
─────┼─────────────────────────────────────────
   1 β”‚     1                3  [101, 102, 103]
   2 β”‚     2                2  [104, 105]

and this is with un-aliasing:

julia> df2 = combine(gdf, nrow => :number_of_users, :user => Ref∘copy => :user_list)
2Γ—3 DataFrame
 Row β”‚ id     number_of_users  user_list
     β”‚ Int64  Int64            Array…
─────┼─────────────────────────────────────────
   1 β”‚     1                3  [101, 102, 103]
   2 β”‚     2                2  [104, 105]
1 Like

@pdeffebach - I think @hdavid16 wants to produce a vector of vectors of original data.

1 Like

Awesome! Thank you @bkamins. What is the symbol between Ref and copy in the second example?

Also, is it possible to remove duplicates (unique) and skipmissing using this method?

it is function composition:

help?> ∘
"∘" can be typed by \circ<tab>

search: ∘

  f ∘ g

  Compose functions: i.e. (f ∘ g)(args...) means f(g(args...)). The ∘ symbol
  can be entered in the Julia REPL (and most editors, appropriately
  configured) by typing \circ<tab>.

You can do whatever you like. Ref just serves as a protection from broadcasting like in Julia Base.

1 Like

Thank you!

Yes. Just make an anonymous function which performs those operations as well, something like x -> Ref(collect(skipmissing(unique(x))).

2 Likes