Summarizing grouped DataFrame where a group is entirely missing

Dear community,
when I want to summarize a grouped DataFrame e.g. with median and a group contains only missing, I get an error, because skipmissing returns an empty array and median(Float64[]) throws an error.

The code below works, but is there a better way than defining an extra function (savefun)?

using DataFrameMacros, Statistics, Chain

d = @chain begin
  DataFrame(x=rand(12))
  @transform :gr = @bycol repeat('A':'D'; inner=3)
  @transform :x_miss=:gr == 'A' ? missing : :x  ## make one group missing completely
end

function calc(df, vbl, gr, fun)
  savefun(x) = try fun(x) catch y missing end
  outvar = string(vbl)*"_"*string(fun)
  @chain df  begin
    @groupby {gr}
    @combine {outvar} = (savefun ∘ skipmissing)({string(vbl)}) 
  end
end

calc(d, :x_miss, :gr, median)

Thanks!

If you skipmissing before the groupby, the missing group will not be created. Is that a good answer?

Probably not so good for my use case, because I would not like to drop entire rows, but I might not fully understand, what you meant.

Yes, this would drop lines, as you mentioned. Since the try/catch feels bad to me, perhaps (can be adapted to macro notation):

combine(groupby(d,:gr), :x_miss => 
  (x-> all(ismissing,x) ? missing : median(x)) =>
  :x_miss_median)

A little verbose, but not too bad.
Result:

4Γ—2 DataFrame
 Row β”‚ gr    x_miss_median  
     β”‚ Char  Float64?       
─────┼──────────────────────
   1 β”‚ A     missing        
   2 β”‚ B           0.33751
   3 β”‚ C           0.771823
   4 β”‚ D           0.726377

Or (same result):

combine(groupby(d,:gr), :x_miss => 
  ((x-> isempty(x) ? missing : median(x))∘ skipmissing) =>
  :x_miss_median)

ADDITION:
Since median of vector containing missing is missing, why not go for the simple:

julia> combine(groupby(d,:gr), :x_miss => median => :x_miss_median)
4Γ—2 DataFrame
 Row β”‚ gr    x_miss_median  
     β”‚ Char  Float64?       
─────┼──────────────────────
   1 β”‚ A     missing        
   2 β”‚ B           0.33751
   3 β”‚ C           0.771823
   4 β”‚ D           0.726377

Keeping the non-missing median of a missing infested group is dangerous by itself (from data science perspective).

BTW this works for me only without { } on outvar.

Not sure if this is just what you’re looking for, but you could use dropmissing with the option view=true.
Example: suppose the following dataframe

df = DataFrame(a = [1,1,2,2,missing], b = [1,2,3,4,5])

julia> df
5Γ—2 DataFrame
 Row β”‚ a        b     
     β”‚ Int64?   Int64 
─────┼────────────────
   1 β”‚       1      1
   2 β”‚       1      2
   3 β”‚       2      3
   4 β”‚       2      4
   5 β”‚ missing      5

Then you can create a view without missing values and calculate the median

temp = dropmissing(df, view=true)
transform!(groupby(temp,:a), :b => median)

with output

julia> df
5Γ—3 DataFrame
 Row β”‚ a        b      b_median  
     β”‚ Int64?   Int64  Float64?  
─────┼───────────────────────────
   1 β”‚       1      1        1.5
   2 β”‚       1      2        1.5
   3 β”‚       2      3        3.5
   4 β”‚       2      4        3.5
   5 β”‚ missing      5  missing   
1 Like

That’s cool. I think you get exactly what @mreichMPI-BGC meant if you setup with df = DataFrame(a = [1,1,2,2,3], b = [1,2,3,4,missing])

Thanks for all the discussion. At the end IMO only this solution works as intended:

Or in the original example:

d = @chain begin
      DataFrame(x=rand(12))
      @transform :gr = @bycol repeat('A':'D'; inner=3)
      @transform :x_miss=:gr == 'A' ? missing : :x
    end

function calc(df, vbl, gr, fun)
  safefun(x) = all(ismissing.(x)) ? missing : fun(x)
  outvar = string(vbl)*"_"*string(fun)
  @chain df  begin
    @groupby {gr}
    @combine begin
    {outvar} = (safefun ∘ skipmissing)({string(vbl)}) ## Variant 1 with explicit function
    {outvar*"_2"} = ((x -> all(ismissing.(x)) ? missing : fun(x)) ∘ skipmissing)({string(vbl)}) ## Variant 2 with anonymous function
    end
  end
end

The other solutions don’t work, because I don’t want missing if only few entries are missing or because I want to combine, not `transform``.

Nice. I think you missed the predicate argument for all(), which would make it:

safefun(x) = all(ismissing, x) ? missing : fun(x)

which is nicer (and like my orig suggestion).
P.S. beyond nicer, it doesn’t materialize a vector of Bools and short-circuits faster.

You could also define

safe(fun) = x -> (all(ismissing, x) ? missing : fun(x))

which you can then apply to any function to create a safe version of it. So you could call safe(median)(vec) or safe(mean)(vec).

As a side note, you don’t need to do {string(vbl)}, you don’t gain anything from vbl being turned into a string if it’s a symbol. And I would probably find it confusing if other objects like Ints were being turned into strings such that vbl = 1 would not return the first column but a column named "1".

1 Like

Great! I was thinking of that, coming from Tidyverse: Capture side effects. β€” safely β€’ purrr .

Would it be possible to propagate arbitrary keyword arguments to fun?

safe(fun, kwargs...) = x -> (all(ismissing, x) ? missing : fun(x, kwargs...))

In R one could use ... often for this. Sorry, hve no code editor open.

Regarding the string β€œthing” - thanks, need to check why I thought I needed it. Maybe a leftover, from when I wanted to manipulate the name before.

Btw I believe making such convenience functions available in a high-level package, would probably increase the user base of Julia a lot. Basically transferring R purrr to Juila…

Yes, but you need a ; to signal that you want keyword arguments, not trailing positional:

safe(fun) = (x; kwargs...) -> (all(ismissing, x) ? missing : fun(x; kwargs...))

I have not used purrr but my experience with Julia has been that after learning the primitives, the need for convenience functions is reduced because they are usually not that hard to build on the fly. Of course, if one keeps redefining the same helpers, a package would be better. The space of possible helper functions is just very large, and if people don’t know that one specific function exists in a package, they will redefine it anyway.

3 Likes

I have opened Consider allowing default in quantile and median Β· Issue #132 Β· JuliaStats/Statistics.jl Β· GitHub to discuss how these cases should be handled in Statistics.jl.

I really only have the user perspective but IMO for many people the high-level, well-structured, consistent and intuitive functions in the tidyverse really made a difference. Of course (and that’s nice) many exist in Julia already natively, but still…

How does purrr solve this problem? The functionality of purrr is pretty much fully captured by map and broadcast in Julia. It doesn’t help handling missings iirc.

Yes, that what I meant with

Just in purrr exist a some β€œhigher level” convenience functions, such a safely, possibly and quietly or map_df etc. Cf. Capture side effects. β€” safely β€’ purrr.

That brings me to the question whether something like

function safe(fun) 
   (x; kwargs...) -> try fun(x; kwargs...) catch y missing end
end

Is something β€œgood”. The advantage is that it would also catch other issues than β€œjust” the β€œall-missing” problem.

What are disadvantages? Performance? Type-stable this should be, if fun is Type-stable?
From a puristic developing point of view probably not good style, but for some β€œbig” data science tasks at least convenient.

Try catch comes with a performance penalty as far as I know, also you’d catch any sort of error with this, even plain bugs like UndefVarErrors. So usually it’s not a good idea.

1 Like

@mreichMPI-BGC - see discussion in Consider allowing default in quantile and median Β· Issue #132 Β· JuliaStats/Statistics.jl Β· GitHub (and maybe comment there what you think from user’s perspective). That discussion is exactly about how to design such things correctly (by correctly I mean to e.g. avoid exceptions when indeed they should be avoided, but at the same time not cover exceptions that you want to be raised unconditionally like OutOfMemory exception, and at the same time ensure that operations would be still fast)

1 Like