And I’m saying that this is as dangerous as missing
s vanishing in any other scalar operation. If you prefer, the example could’ve been about a comparison <5.5
instead of a units conversion, and the mean would describe the proportion of the team shorter than 5 1/2 feet. The mean needs [true, missing, false, missing]
to divide by the 2 people that showed up; [true, false, false, false]
looks the exact same as if all 4 were present. If I really need the latter, I can replace missing
with false
myself (destroying information), but I can’t possibly do the inverse (creating information). If you really want to skip missings, propagate those missings in scalar operations so there’s something to skip in folding operations.
there has to be somewhere to draw the line ; not every function can propagate missing
, but you raise good arguments as to why some of them should
with that being said, I think my ideal world would be:
- binary operators propagate,
- statistical moments and quantiles (and
cor
andcov
) skip, - everything else
MethodError
,
so mean(heights .> 5.5)
would work just fine
We should just use some unicode. This seems appropriate:
julia> mean(🙈([1.0, missing, 2.0]))
1.5
(sorry)
I do think this is the most likely solution to actually happen, because there’s a tradeoff here between correctness/explicitness and conciseness. A macro gives us most of the conciseness (just apply the macro wherever you need) without the problem of making it easier for users to write incorrect code.
Personally I’m not sure we even want a macro for this, because it encourages absentmindedly dropping missings instead of careful modeling, but I think it’s a reasonable compromise most people can live with (even if it’s not their first choice).
By the way, I’d like to point out MICE.jl could make this a lot easier by giving a good default method for imputation, especially if you combined it with MonteCarloMeasurements.jl (so you can use any imputed values like normal numbers).
I would encourage you, as other commenters sharing similar sentiments have been previously encouraged in this thread, to try not to assume that a desire to make skipmissing
easier to use is born only out of sloppiness, recklessness, ignorance, or laziness
I have carefully considered my data, and I very intentionally want to skip the missings. and I need to do it frequently. and the verbosity that requires makes me painfully more present-minded than I would otherwise want to have to be
I’m not assuming that every case is like that; dropping missings can be a perfectly reasonable choice when you think your data is missing completely at random. But there’s definitely going to be some users who would drop missings accidentally, if it were done automatically (I’ve done this myself!).
This is why I think we should go with the compromise option of making it easy, but still explicit; explicit is better than implicit.
in many cases where missing
values arise, they do not represent actual “missing observations.” sometimes they are just placeholders in a table where no observation is possible (e.g. in some diagonal joins)
I think that falls under missing completely at random (if the populations for each table are identical). (And if they’re not, you’d need to be explicit about what you’re modeling/estimating.)
If removing the name skipmissing
entirely was unpopular because it inherently removes the description, we could alias it so what we write is shorter but it prints the same, like const sm = skipmissing
or import Base: skipmissing as sm
. Probably still not as short as one would like.
I’m sure there is a use-case for MICE.jl, but I wouldn’t use it in analysis.
- It’s far simpler to simply say “here are the results of the population for which I have data” and then assess reasons for missing data and plausible assumptions.
- What is the limiting distribution of
sigma(x, N)
asN
approaches infinity? We would like this to converge to a normal distribution and use standard statistical tests. If we imposesigma(f(x), N)
wheref
is some multiple-imputation method, what is the limiting distribution then?
More of an FYI, in Query.jl I’m using DataValues.jl to represent missing values. The original reason was a technical one unrelated to the semantics of how missings propogate, but they also happen to have slightly different semantics. In particular, they (mostly) propagate missings, but not through predicates. So, one is way less likely to end up in 3VL world. At least for me that has worked pretty well.
A pattern I use a lot and that I think works quite well in Query.jl is to drop missing values early on in a query:
data |>
@dropna(:colA, :colC) |>
@filter(_.colB==3) |>
...
This will drop rows that have a missing in either colA
, or colC
. Because these queries are all functional and non-mutating, one can do this pretty easily on a per-query basis, and at least for me in practice the syntax is concise enough to not be too painful.
This is a good idea, and emulates Stata’s gen y if x == 1
syntax quite closely. DataFramesMeta.jl has explored providing a convenient syntax for it but I have not gotten around to a fully-formed PR.
The idea behind SkipMissingByDefault
is pure convenience for those who want to opt into that behavior. Under the hood it uses skipmissing
. The package is meant to cover the superficial convenience case for programmer ergonomics. It is not meant as a deep engineering solution. Rather it is a consumer of such solutions. Overall, it provides a one line solution to chnage the default behavior at the end-user level.
This is a superficial request that the original poster acknowledges is one of preference. Instead of answering a superficial request with a complex engineering solution, let’s meet it with a superficial solution. In no way, I am using “superficial” here in a disparaging connotation.
Moreover, my example points out that the Julia Language is flexible enough to accommodate this request for better or for worse. My example should also inspire a question whether such a package is necessary. The user can change the default behavior of Main.mean
.
_
_ _ _(_)_ | Documentation: https://docs.julialang.org
(_) | (_) (_) |
_ _ _| |_ __ _ | Type "?" for help, "]?" for Pkg help.
| | | | | | |/ _` | |
| | |_| | | | (_| | | Version 1.9.3 (2023-08-24)
_/ |\__'_|_|_|\__'_| | Official https://julialang.org/ release
|__/ |
julia> using Statistics
julia> mean(x) = Statistics.mean(skipmissing(x))
mean (generic function with 1 method)
julia> @which mean
Main
julia> mean([1.5 2.0 missing])
1.75
We can override a function name with our own function (as opposed to overloading a method) by making sure to state the new definition in the current module before using it.
If we use it before redefining it, we get an error.
# julia
_
_ _ _(_)_ | Documentation: https://docs.julialang.org
(_) | (_) (_) |
_ _ _| |_ __ _ | Type "?" for help, "]?" for Pkg help.
| | | | | | |/ _` | |
| | |_| | | | (_| | | Version 1.9.3 (2023-08-24)
_/ |\__'_|_|_|\__'_| | Official https://julialang.org/ release
|__/ |
julia> using Statistics
julia> mean([1.5 2.0 missing])
missing
julia> mean(x) = Statistics.mean(skipmissing(x))
ERROR: error in method definition: function Statistics.mean must be explicitly imported to be extended
Stacktrace:
[1] top-level scope
@ none:0
[2] top-level scope
@ REPL[3]:1
Note that we do not want to import Statistics.mean
as the error suggests since this will change the behavior for all packages. Rather we just want to change the behavior of mean
in the current module.
It would probably make sense to contain methods using this altered definition in a module to limit its scope.
module MyProject
module SkipMissingByDefault
using Statistics: Statistics # only import Statistics and not mean
mean(x) = Statistics.mean(skipmissing(x))
mean_plus_one(x) = mean(x) + 1
end
using .SkipMissingByDefault: mean_plus_one
end
This is more evidence for the point that missing
should be split into different types, not both Missing
. Values with different interpretations and desired behavior probably shouldn’t have the same type.
FWIW from a non-regular user of DataFrames
, I strongly prefer the full skipmissing
. To me, seeing mskip
in code would be no more clear than sm
. I could guess that something is getting skipped, but since I don’t do a lot of data science I would not guess off-hand that it would skip missing values.
I wouldn’t make that grain of salt too big when thinking about the poll results.
And while I’m here, again as a casual user only, I want the default behavior to be the most correct statistically. If I have column of data with 100 values and 10 missing
s, then I want all my calculations to take place on the valid 90 data points only and do not want the package/function to make assumptions about values. Using the cnonical example here. If I do x < 100
, then I either want a missing back or an error. I definitely do not want a boolean back.
missing is hard to work with (Union{}…), so I believe a fair number of us (who do stats) still rely on NaN. Anyhow, dropping NaNs/missings is almost trivial in basic stats. It only becomes a challenge when you need to drop whole cases (y_i,x_i)
because there’s a missing somewhere in x_i
.
this does not preclude mean
giving you a float back
like I suggested above, I think it would make a lot of sense for operators e.g. <
to continue to propagate as they do currently
As someone that rarely deals with missing values I found this topic interesting. Out of curiousity I thought I would do some basis stats (by hand) on this thread and categorize them as wanting current behaviour, skipmissing by default, or Listen and Learn without expressing a strong preference.
Total number of contributors (before mine) 26
Number wanting skipmissing by default 2
Number wanting current behaviour 12
Number listening and contributing 12
Number of contributors with >= 10 posts 2
Number of contributors with >= 5 but < 10 posts 2
Conclusion: The development team did a good job listening to what users want, but there is always room for improvement, with some very thoughtful suggestions on what that may look like.
I think this actually highlights some of the biases that lead to the current difficult-to-use behavior. The average data analyst who is most annoyed by the verbosity of dealing with missing data probably is not on Discourse.
Until now I was missing…