Dear community,
when I want to summarize a grouped DataFrame e.g. with median and a group contains only missing, I get an error, because skipmissing returns an empty array and median(Float64[]) throws an error.
The code below works, but is there a better way than defining an extra function (savefun)?
using DataFrameMacros, Statistics, Chain
d = @chain begin
DataFrame(x=rand(12))
@transform :gr = @bycol repeat('A':'D'; inner=3)
@transform :x_miss=:gr == 'A' ? missing : :x ## make one group missing completely
end
function calc(df, vbl, gr, fun)
savefun(x) = try fun(x) catch y missing end
outvar = string(vbl)*"_"*string(fun)
@chain df begin
@groupby {gr}
@combine {outvar} = (savefun ∘ skipmissing)({string(vbl)})
end
end
calc(d, :x_miss, :gr, median)
safe(fun) = x -> (all(ismissing, x) ? missing : fun(x))
which you can then apply to any function to create a safe version of it. So you could call safe(median)(vec) or safe(mean)(vec).
As a side note, you don’t need to do {string(vbl)}, you don’t gain anything from vbl being turned into a string if it’s a symbol. And I would probably find it confusing if other objects like Ints were being turned into strings such that vbl = 1 would not return the first column but a column named "1".
Btw I believe making such convenience functions available in a high-level package, would probably increase the user base of Julia a lot. Basically transferring R purrr to Juila…
I have not used purrr but my experience with Julia has been that after learning the primitives, the need for convenience functions is reduced because they are usually not that hard to build on the fly. Of course, if one keeps redefining the same helpers, a package would be better. The space of possible helper functions is just very large, and if people don’t know that one specific function exists in a package, they will redefine it anyway.
I really only have the user perspective but IMO for many people the high-level, well-structured, consistent and intuitive functions in the tidyverse really made a difference. Of course (and that’s nice) many exist in Julia already natively, but still…
How does purrr solve this problem? The functionality of purrr is pretty much fully captured by map and broadcast in Julia. It doesn’t help handling missings iirc.
That brings me to the question whether something like
function safe(fun)
(x; kwargs...) -> try fun(x; kwargs...) catch y missing end
end
Is something “good”. The advantage is that it would also catch other issues than “just” the “all-missing” problem.
What are disadvantages? Performance? Type-stable this should be, if fun is Type-stable?
From a puristic developing point of view probably not good style, but for some “big” data science tasks at least convenient.
Try catch comes with a performance penalty as far as I know, also you’d catch any sort of error with this, even plain bugs like UndefVarErrors. So usually it’s not a good idea.
@mreichMPI-BGC - see discussion in Consider allowing default in quantile and median · Issue #132 · JuliaStats/Statistics.jl · GitHub (and maybe comment there what you think from user’s perspective). That discussion is exactly about how to design such things correctly (by correctly I mean to e.g. avoid exceptions when indeed they should be avoided, but at the same time not cover exceptions that you want to be raised unconditionally like OutOfMemory exception, and at the same time ensure that operations would be still fast)