Why are missing values not ignored by default?

No, it is not condescending and dismissive: data science in my eyes is a hugely important development .

Nonetheless, by far not everything done even in the best possible way by well-educated and trained professionals is science.

In any case, I don’t think I can add anything to this discussion.

3 Likes

skipmissings is a footgun and it’s always annoying to me when missing enthusiasts claim that everyone else should just litter their code with skipmissing, and guard conditionals to compensate for the leaky way missing was implemented.

skipmissing can cause changes to the algorithm which can cause catastrophic loss of precision unexpectedly. For example:

julia> let v = rand(Float32, 100_000_000)
           mean(v), mean(skipmissing(v))
       end
(0.49996603f0, 0.16777216f0)
10 Likes

My understanding is that the missing semantics were designed with the help of people who work extensively in data science, e.g. the DataFrames.jl authors. The design philosophy is described in this blog post: First-Class Statistical Missing Values Support in Julia 0.7. A key paragraph is:

In addition to being generic and efficient, the main design goal of the new missing framework is to ensure safety, in the sense that missing values should never be silently ignored nor replaced with non-missing values. Missing values are a delicate issue in statistical work, and a frequent source of bugs or invalid results. Ignoring missing values amounts to performing data imputation, which should never happen silently without an explicit request. This is unfortunately the case in some major statistical languages: for example, in SAS and Stata, x < 100 will silently return true or false even if x is missing[3]. This behavior is known to have caused incorrect results in published scientific work[4]. Sentinel approaches also suffer from bugs in corner cases: for example, in R, NA + NaN returns NA but NaN + NA returns NaN due to floating point computation rules.

35 Likes

This is 100% my view and is one of the reasons I do data science in Julia. The “usual” semantics in R or Python are just broken.

22 Likes

This is quite surprising and should probably be filed as a bug?

4 Likes

I am a university teacher and was a PhD doing research and using Julia to make data analysis and I am completely 100% favorable to the current approach. Skipping missing by default is a nightmare that I do not want to deal with.

20 Likes

It’s not a bug, it’s the intended behaviour of mean hitting a generic fallback. There are so many different wrapper types out there in the world that specializing all of them to account for skipmissing is an unending game of whack-a-mole.

That said, I’m sure if you submit a PR adding a specialized method for mean(::skipmissing) it’d be appreciated, and I’d have to hunt down a new example to whine about.

9 Likes

fair enough, I retract that

still

unfortunately […] in SAS and Stata, x < 100 will silently return true or false even if x is missing

right, it should be false, because it is false

there are many ways to be wrong. the onus is on the user to handle missing values in their data correctly, and ideally not at the expense of convenience for everyone else

I understand that this is a safety / convenience tradeoff. it’s just that this tradeoff has never, ever, been worth it to me. and it annoys me so very much that I find it pretty surprising that this tradeoff is worth it for anyone at all.

of course, I keep receiving testimonials in these threads from those for whom it is worth it (e.g. @Henrique_Becker ) but I cannot help but wonder if there is significant selection bias going on? that is, those continuing to use Julia for data analysis are only those willing to put up with (or actually prefer) the missing behavior

1 Like

The basic issue is that sum for arrays can use pairwise summation, which has very slow (logarithmic) error accumulation but relies on random access.

For a generic iterator (like what is returned by skipmissing), in contrast, sum has to loop over the contents in sequence, which accumulates errors more quickly, and the difference is quite noticeable in low precision. Similarly, mean falls back to an in-sequence loop for generic iterators.

In general, for reduce functions the order/associativity of the reduction is implementation-defined and can vary for different container types, which affects floating-point roundoff errors.

Of course, we could implement a special-case pairwise reduction for SkipMissing{<:AbstractArray} iterators, and this might be a good idea because of how often this iterator type is probably used in statistics.

20 Likes

We’ve talked about this before. The options we’ve come up with are

  • Macros to skip missings. For example, @miss Statistics.mean(::Vector) can insert skipmissing.
  • Different statistics modules for different missing behaviors. Copy the interface of Statistics and StatsBase into MissingStats with functions that skip missings. Libraries will have to choose which one they depend on, which is not great for the end user. There’s an initial hump to getting the packages started, but otherwise they can be implemented piece by piece as people need each new function.
  • Different array types for different missing behaviors. For example, Statistics.mean(::MissingVector) can automatically skip missings. This is slightly asymmetric in that Array gets special syntax and most packages return Array, but hopefully not so bad. Likewise this can be implemented incrementally.
  • Dynamically scoped configuration. https://github.com/JuliaLang/julia/pull/50958 would allow mean to be implemented with a configuration setting for the missingness behavior that could be set by the user. I’m not sure if this is safe in general, since libraries that call statistical functions won’t be prepared for alternative configurations that the user might set — but we haven’t really worked through that thought experiment yet.
  • Add a new kind of <: AbstractMissing value that’s skipped automatically (from below).
7 Likes

Yeah the 2nd and 3rd approaches are quite easy to implement in a separate package. Assuming there is indeed a sizeable fraction of people who want this behavior more easily accessible, it’s even a bit strange there’s still no such package.

If you are able to work with a table structure for data science you might better off using sql within Julia. You could register your DataFrame in duckdb (I think that’s zero copy) and then query the in memory dataset using sql. The aggregation functions will ignore null (missing) values by default in sql. I think your concern would be the same in other stats languages like R (na.rm is F by default) so maybe sql might be a better fit.

2 Likes

Actually we already have special mapreduce methods for SkipMissing, which are be fast while still using pairwise summation (Add optimized mapreduce implementation for SkipMissing by nalimilan · Pull Request #27743 · JuliaLang/julia · GitHub). But since mean no longer uses sum, it doesn’t benefit from them. We should probably add a special method for it.

11 Likes

Don’t get me wrong, I understand the usefulness of not skipping missing values. My point is that should be the option, not the default behavior. Overall, this is quite opinionated, and this post was actually part of the thread what I don’t like about Julia.

In some cases, you want to ensure not adding mistakes. You could even do it at the final stage of doing data analysis. However, even if missing is pointing out a mistake, right now I skip missings without thinking twice.

In general, I think that given the distinction in Julia between NaN, nothing and missing, I think missing should have been treated as a special data type for data analysis. Or add a data type (let’s call it undisclosed) that has this behavior. So you could replace missing for undisclosed in your dataset, and then you can choose what you prefer.

Also, note that it’s not just skipmissing. It also affects other aspects, like this example I was giving to convert strings into numbers.

x = ["0", "1", missing]
tryparse.(Int, x)  # this errors
passmissing(tryparse).(Int, x)

or when you want to filter a dataframe

x = ["0", "1", missing]
x .== 0  #this errors
isequal.(x, 0)

I even had to create my own functions for comparisons with >=, etc

I really think that if you want to ignore missings, you should just create a new dataset explicitly eliminating them from the parent dataset. This is exactly what I do when I want to do a bunch of really basic analysis where you’d normally “skipmissing”.

This can be done explicitly even while constructing the DataFrame from disk… so you don’t need to read in the whole data and then make a copy.

I don’t really understand “I just always want to skip missings” and then also… keeping them in your dataset. what’s the point?

5 Likes

when you work with a dataframe, missing can be in different rows and columns. So you can’t eliminate missing or you’d lose data from certain columns.

4 Likes

The number of cases where I don’t care about missing but I am concerned about that issue is extremely small. Im either concerned enough that I do imputation using a Bayesian model, or I have a ton of data and I don’t care about this bias because other biases are more important. Obviously everyone’s mileage may vary.

1 Like

Maybe the InMemoryDatasets package would be more appropriate. It’s fairly well documented in how it handled missing values and it’s one of the fastest data frames across the three main languages for data science.

What do you mean it’s false? A missing often represents a value that wasn’t recorded for some reason, for example someone didn’t show up for the appointment to measure their blood pressure. Making x < 100 return false in this case would be wrong.

Imagine all the computations that would silently give wrong results if such a behavior was implemented… And users would not even realize that something subtle is going on, they would just get a number (often meaningless) with no warning.

28 Likes

That’s also what I do most of the time, I load some data, clean them up (impute or filter out missing’s, take subsets, split them, etc.) and only then do the analysis. Yes it doesn’t work for all cases but it still reduces the need to use skippmissing & co. Plus I prefer to take decisions about missing values in a single place.

7 Likes