Dear community,
when I want to summarize a grouped DataFrame e.g. with median and a group contains only missing, I get an error, because skipmissing returns an empty array and median(Float64[]) throws an error.
The code below works, but is there a better way than defining an extra function (savefun)?
using DataFrameMacros, Statistics, Chain
d = @chain begin
DataFrame(x=rand(12))
@transform :gr = @bycol repeat('A':'D'; inner=3)
@transform :x_miss=:gr == 'A' ? missing : :x ## make one group missing completely
end
function calc(df, vbl, gr, fun)
savefun(x) = try fun(x) catch y missing end
outvar = string(vbl)*"_"*string(fun)
@chain df begin
@groupby {gr}
@combine {outvar} = (savefun β skipmissing)({string(vbl)})
end
end
calc(d, :x_miss, :gr, median)
safe(fun) = x -> (all(ismissing, x) ? missing : fun(x))
which you can then apply to any function to create a safe version of it. So you could call safe(median)(vec) or safe(mean)(vec).
As a side note, you donβt need to do {string(vbl)}, you donβt gain anything from vbl being turned into a string if itβs a symbol. And I would probably find it confusing if other objects like Ints were being turned into strings such that vbl = 1 would not return the first column but a column named "1".
In R one could use ... often for this. Sorry, hve no code editor open.
Regarding the string βthingβ - thanks, need to check why I thought I needed it. Maybe a leftover, from when I wanted to manipulate the name before.
Btw I believe making such convenience functions available in a high-level package, would probably increase the user base of Julia a lot. Basically transferring R purrr to Juilaβ¦
I have not used purrr but my experience with Julia has been that after learning the primitives, the need for convenience functions is reduced because they are usually not that hard to build on the fly. Of course, if one keeps redefining the same helpers, a package would be better. The space of possible helper functions is just very large, and if people donβt know that one specific function exists in a package, they will redefine it anyway.
I really only have the user perspective but IMO for many people the high-level, well-structured, consistent and intuitive functions in the tidyverse really made a difference. Of course (and thatβs nice) many exist in Julia already natively, but stillβ¦
How does purrr solve this problem? The functionality of purrr is pretty much fully captured by map and broadcast in Julia. It doesnβt help handling missings iirc.
That brings me to the question whether something like
function safe(fun)
(x; kwargs...) -> try fun(x; kwargs...) catch y missing end
end
Is something βgoodβ. The advantage is that it would also catch other issues than βjustβ the βall-missingβ problem.
What are disadvantages? Performance? Type-stable this should be, if fun is Type-stable?
From a puristic developing point of view probably not good style, but for some βbigβ data science tasks at least convenient.
Try catch comes with a performance penalty as far as I know, also youβd catch any sort of error with this, even plain bugs like UndefVarErrors. So usually itβs not a good idea.
@mreichMPI-BGC - see discussion in Consider allowing default in quantile and median Β· Issue #132 Β· JuliaStats/Statistics.jl Β· GitHub (and maybe comment there what you think from userβs perspective). That discussion is exactly about how to design such things correctly (by correctly I mean to e.g. avoid exceptions when indeed they should be avoided, but at the same time not cover exceptions that you want to be raised unconditionally like OutOfMemory exception, and at the same time ensure that operations would be still fast)