Dropmissing(!) is undefined for GroupedDataFrames

Hi,

The following code throws an error:


df.ticker = vcat(["AAPL" for i in 1:10], ["TSLA" for i in 1:10])
df.prices = vcat(rand(19), missing)

gdf = groupby(df, :ticker)

dropmissing!(gdf)

The only solution I have at the moment is:

df = combine(gdf, valuecols(gdf))

dropmissing!(df)

gdf = groupby(df, :ticker)

Which seems awkward and inefficient. I’ve also tried broadcasting drop missing over the grouped data frame, but that seems to be deliberately disallowed.

Does anyone have a better workaround, or know why dropmissing isn’t defined for GDFs?

Thanks!

Hi,

Why would you want to do dropmissing on gdf rather than on df before grouping it?

What I mean is that in your original code I would thought you would do:

df.ticker = vcat(["AAPL" for i in 1:10], ["TSLA" for i in 1:10])
df.prices = vcat(rand(19), missing)
dropmissing!(df)
gdf = groupby(df, :ticker)

Because in reality I have an intermediate function that requires groups, that is guaranteed to return missing values:

function calculate_log_return(X)
    N = size(X, 1)
    out = Vector{Union{Missing, Float64}}(undef, N)
    out[1] = missing
    for i in 2:N 
        @inbounds out[i] = log(X[i]) - log(X[i - 1])
    end
    out
end

transform!(gdf, :prices => calculate_log_return => :log_return)

But the result of your transform! function is a data frame (not a grouped data frame), so you can drop missings from it.

So while I am still unsure what you need exactly note that you can write e.g. combine(gdf, dropmissing, ungroup=false) and you should get a GroupedDataFrame with dropped missings - is this what you wanted?

2 Likes

Ah, there’s a subtlety here that you’ve just made me realise:

transform! does actually return a value and this value is an ungrouped data frame, but it also leaves its argument (appropriately transform) grouped. So we get:

gdf = groupby(df, :ticker)

new_df = transform!(gdf, :prices => calculate_log_return => :log_return)

where gdf is still a grouped data frame, but new_df is not.

So yes, combine(gdf, dropmissing, ungroup=false) is exactly what I need, thanks. However, I think it should be considered a bug that an in place function ‘returns’ values of two different types!

Could you please explain what you mean is a bug? transform! by default returns a data frame. If you passed ungroup=false to transform! it would return GroupedDataFrame. It is your choice what value you want returned.

The fact that transform! is in-place means that it updates the passed argument (and it does it).

In place function does not have to return its argument (transform! is consistent in what it returns with transform).

To show you an example from Base Julia that in-place function does not return its argument consider for example:

julia> x = [1, 2, 3]
3-element Vector{Int64}:
 1
 2
 3

julia> pop!(x)
3

julia> x
2-element Vector{Int64}:
 1
 2
3 Likes

Firstly, I didn’t expect transform! to return anything (and I don’t think it should). I get that it’s convenient (e.g. pop!) for in place functions for return values as well, and that there might be cases when the type of the return value is different to the type of the mutated argument. However, I feel like this should be for very special cases that are made explicit. Beyond my personal gripes with having an in-place function return a value, my reasons for calling this behaviour with transform! and dropmissing! a bug are:

  1. If I transform! a gdf I expect to get a gdf left in place. If I also wanted to store the result (like new_df = transform!(...) then I would expect that to be of the same type, a gdf. I think the fact that the return value of transform! is different to the in-place value is a bug because it’s unexpected behaviour. If this is supposed to be a convenience trick, to let you carry on with an ungrouped df once you’re finishing transforming, then I think that should be made clear in the docs. It would be quite strange if D = mul!(A, B, C) left A as a Matrix{Float64} but returned D as a Vector{Vector{Float64}}.

  2. Either way, dropmissing! still isn’t defined for gdfs, which is a least a shame and would be great to see. There are lots of cases (see above) where a split-apply operation will produce missing values and the author of the code would like to keep a gdf with missing values removed. It looks like it’s as simple as dropmissing!(gdf::GroupedDataFrame) = combine!(gdf, dropmissing, ungroup=false).

One thing to note, “unexpected behavior” is different than “undocumented behavior”. Bugs are when the behavior violates the contract described in the documentation. So this technically doesn’t qualify as a bug.

One other note, having transform! return a value really helps with chaining.

using Chain
@chain df begin 
    transform!(df, :x => normalize => :y)
    select(:y)
end
1 Like

I think you meant counterintuitive behavior, and I kinda of agree.

@pdeffebach I’ve not used Chain before but that example looks a little odd - surely you only need transform (not in-place) if you want to immediately pass that on to select? Unless you also want the mutated DataFrame to be available after you’ve inspected :y? I suppose there should be a performance gain from working in place and then keep :y is free. I do love the dplyr syntax though! I will check it out.

@Henrique_Becker I’m not convinced that there’s much of a difference here between counterintuitive and unexpected, but sure! I think I would just point back the the mul! example I made earlier. And I don’t mind if it’s not officially a ‘bug’, it’s still worth pointing out and IMHO still worth changing.

There is a major performance gain. I think there is a huge benefit in being able to write slow code, with transform and then with one simple change make all of it fast, by adding a !.

1 Like

Let me add some comments:

  1. As I have already explained transform and transform! both can return either a data frame or a grouped data frame; the question is which of them is returned by default; this issue was discussed a lot before we made this decision and most of the users preferred data frame by default (admit that it was not an easy decision - different people have different expectations about default behavior). As @pdeffebach noted in the end you just need to learn what is the default and what is the opt-in.
  2. The choice between transform! and transform when chaining is mostly the performance issue as you comment. However, exactly for this reason we do not provide combine! (which you have used in your example); having combine! most likely would not give any performance benefit for the user.
  3. Now why dropmissing! does not work on GroupedDataFrame? The reason is that dropmissing! potentially drops rows from a GroupedDataFrame (even it could drop whole groups if you removed all rows from some group). This means that doing it in-place would require re-grouping the GroupedDataFrame. So there would not be much benefit from this in general. However, adding dropmissing and potentially also dropmissing! for GroupedDataFrame object could be considered. Could you please open an issue about it if you really feel they should be added so that we can discuss it? (the point is - exactly as you say it would be just a one liner combine(gdf, dropmissing, ungroup=false) for a non-in-place case so the question would be if it is worth adding it - but maybe it is - let us discuss)
4 Likes

Thanks for this! To summarise:

  1. I think it’s great that transform offers the choice over the return type, I just think that it shouldn’t be inconsistent with it’s in-place value, or this should be made very clear.

  2. Out of interest, why doesn’t combine benefit from being in place?

  3. I see your point about the regrouping issues for dropmissing!. I will open an issue later this week to suggest a dropmissing implementation - it’s small but I do think it’s worth having, and would put DataFrames in line with dplyr:

library(tidyverse)
data <- as_tibble(airquality)
gdf <- group_by(data, Month)
gdf <- drop_na(gdf)

In DataFrames.jl with Chain this is

julia> using DataFrames, Chain;

julia> df = DataFrame(a = [1, 1, 1, 2, 2, 2], b = [4, missing, 5, 8, missing, 9]);

julia> @chain df begin
           groupby(:b)
           combine(dropmissing)
       end
4×2 DataFrame
 Row │ b       a     
     │ Int64?  Int64 
─────┼───────────────
   1 │      4      1
   2 │      5      1
   3 │      8      2
   4 │      9      2

The distinction between grouped data frames and data frames is very clear in Julia, and that’s a really nice feature imo. A GroupedDataFrame is a collection of AbstractDataFrames and just has the methods needed for that.

1 Like

Because combine changes number of rows in general. This means that essentially implementation of combine! would have the following steps:

  1. run combine and store the result on a side
  2. replace the contents of the source data frame with the result of step 2.

E.g. transform! benefits from being in place as it does not touch the columns that are already present in the data frame (while transform would copy them by default).

Regarding what @pdeffebach said I think it is important to highlight that GroupedDataFrame is not just “data frame with a set grouping variable”, but rather it is “a collection of data frames”. The point is that you can rearrange grouped data frame or subset it. For example:

julia> df = DataFrame(a=1:3)
3×1 DataFrame
 Row │ a
     │ Int64
─────┼───────
   1 │     1
   2 │     2
   3 │     3

julia> gdf = groupby(df, :a)
GroupedDataFrame with 3 groups based on key: a
First Group (1 row): a = 1
 Row │ a
     │ Int64
─────┼───────
   1 │     1
⋮
Last Group (1 row): a = 3
 Row │ a
     │ Int64
─────┼───────
   1 │     3

julia> gdf[[3, 1]]
GroupedDataFrame with 2 groups based on key: a
First Group (1 row): a = 3
 Row │ a
     │ Int64
─────┼───────
   1 │     3
⋮
Last Group (1 row): a = 1
 Row │ a
     │ Int64
─────┼───────
   1 │     1

In other words it is better to think of GroupedDataFrame like a vector of data frames it is not a vector because it has some extra features that vectors do not have, e.g. you can index it with column keys:

julia> df = DataFrame(a='a':'c')
3×1 DataFrame
 Row │ a
     │ Char
─────┼──────
   1 │ a
   2 │ b
   3 │ c

julia> gdf = groupby(df, :a)
GroupedDataFrame with 3 groups based on key: a
First Group (1 row): a = 'a'
 Row │ a
     │ Char
─────┼──────
   1 │ a
⋮
Last Group (1 row): a = 'c'
 Row │ a
     │ Char
─────┼──────
   1 │ c

julia> gdf[('b',)]
1×1 SubDataFrame
 Row │ a
     │ Char
─────┼──────
   1 │ b

This is the reason why “by default” we have not added dropmissing for GroupedDataFrame (as it is collection of data frames), but as said we can discuss adding it as a convenience method.

3 Likes

Also if we add dropmissing(::GroupedDataFrame), we should review the API to check whether similar methods should be added for consistency. allowmissing comes to mind, but there are also more controversial cases like unique, deleteat!, insertcols! and so on. It’s hard to decide where to stop, which can be a reason not to start adding such methods in the first place. :slight_smile: