Why are missing values not ignored by default?

because missing, as a value, is not equal to any other value in any meaningful sense

I understand the idea behind three-valued logic, but I wish this had been opt-in in something like ThreeValuedLogic.jl instead of baked into Base into functions as basic as ==

Imagine all the computations that would … give … results

I imagine this frequently :upside_down_face:

1 Like

Would you agree that x < y implies y > x?

If missing < y is false , then y < missing should be true, but that contradicts even your expectations, no?

2 Likes

You can have both missing < y be false and missing > y be false. It’s not a number, it’s a missing value and doesn’t need to have the same semantics as the reals.

1 Like

Seems to me from the discussion that the best would be to have a package that deals with another type, such as unsafe_missing, that implemented all the functions desired, so that someone can just replace all the missings in their data with that one and move forward with their preferred choice.

4 Likes

I see. But if missing doesn’t need to have the same semantics as real numbers, why mix that semantics with the semantics of the reals?

Let’s say we skip missings by default (which the original discussion is about). Then we have x + missing = x (e.g., missing acts as “zero” here). What about x * missing? Does missing still act as zero here? That would not align with the “skip missing” expectations, right? So maybe it should act as one, i.e., x * missing = x. But then we loose, e.g., the distributive property a * (b + c) = a * b + a * c because a * (b + missing) = a * b while a * b + a * missing = a * b + a.

So yeah, imho the current behavior is only reasonable. The user should decide the semantics of missing.

10 Likes

It would make it easier to do something like that if we added a new supertype AbstractMissing >: Missing to the Base package. (This can be done in 1.x — adding a new supertype is not considered a breaking change IIRC.)

That way, a lot of packages like DataFrames.jl could more easily support alternative missing types, e.g. an ImputedMissing type with some kind of semantics of automatic imputation, without having to specifically depend on new packages.

17 Likes
  1. The only Pareto improvement at the moment is to figure out better nicknames for common operations. skipmissing is simply too long for people to want to type. I think everyone would be happier if we had

    • mskip for skipmissing
    • mpass for passmissing
    • A new name for proposed spreadmissing functionality in a PR here which is being worked on and will provide more abstract array behavior for skipping missing values.
    • Have behavior along the lines of MissingsAsFales.jl merged into Missings.jl, providing a macro to coerce boolean operations with missing to false.

    With shorter names and more macros, Julia makes it significantly easier to work with missing values. Those coming from Python or Stata might think Julia is tough on the user, but R is not so different than Julia at the end of the day. In R you have to write mean(x, na.rm = T) to skip missing values. No one claims R has slow uptake for data analsyis because of na.rm = T. I really think if skipmissing(...) wasn’t so long and annoying to type, we would see far fewer complaints.

  2. Making a new missing value is not a good avenue. If we make IgnoreMissing a type in a new package, we would have to have overloads for a lot of statistical functions in Base or Statistics, like mean, cor, sum, etc. It’s one thing to add missing support for base functions, but it adding them for new packages would involve convincing every author of a new statistical package to support the IgnoreMissing type. Given the responses in this thread, many authors do not believe missings should ever be skipped. It would involve having the same (tired) arguments about missing semantics over and over again for every package. In practice, IgnoreMissing would simply error or require it’s own skipmissing function, making it indistinguishable from the existing problems.

  3. Many of the solutions proposed in this thread are not realistic solutions to the challenges people working with messy data face.

    • Dropping missing at the dataset level: This is not feasible when there are many columns which containing missing values. Simply calling dropmissing(df) would result in no observations.
    • Assuming missing stems from data errors, or that there is some value to impute missing to: For example, my dataset may contain a :wage variable and an :employed variable. If someone is not employed, their :wage is missing. There is no imputation I can make to make :wage make sense. I just have to skip it when calculating the mean. And this is also the statistical right thing to do.
    • The conflation of missing and errors in the code. Many replies in this thread treat missing propagation as good because it prevents errors, and chastise those of us who work with missing values as being lazy. If the only thing that prevents you from finding errors in your data is a missing showing up, then you have much bigger problems. We spend lots of time assessing data integrity even before beginning analysis. Julia’s job should be to assist in analysis, trusting the researcher to know what’s right, rather than paternalisticly putting guardrails up. It also doesn’t help that you can get missing with only one missing value out of 1 million, so propagation of missing can be extremely punishing compared to any actual data issues.
  4. I am very sympathetic to Julia developers and those who write packages and how missing propagation can make it more difficult to maintain large code-bases. Someone doing data analysis might prefer missing < y to be false (the semantics being, < is only true if we are sure the value is less than y, which is perfectly acceptable), but someone maintaining, say, the VS Code extension will not want that behavior.

    I don’t have a great answer to this. On the one hand, the vast majority of users probably want missing to y < missing to be false and for mean([1, missing]) to return 1. On the other hand, growing Julia’s user-base also involves well-functioning packages, in particular those written by major corporations who purchase consulting services that fund lots of Julia development.

    Nonetheless I think there could be a fair trade-off that package developers are the ones who have to use functions like errormissing (which does not exist at the moment). Package developers are, after all, the ones who are more specialized and skilled at handling edge cases. This is an unpopular position among both Base devs and package devs.

    Beyond Boolean comparisons, it’s hard to draw the line between “user” analysis code and “developer” code. Should missings propagate for every operation on Char? even when data analysis does not involve working with Chars very frequently?

Given these trade-offs, I want to re-iterate point 1. We simply need better syntax for handling missing values. The names passmissing and skipmissing should be shortened, and we should write more specialized methods for common functions to ensure summations are correct.

DataFramesMeta.jl provides some features to make this easier, such as dropping missing values in @subset, adding the @passmissing macro-flag, and hopefully more in the future.

8 Likes

I’ve experimented custom missings values (but with a different goal) in TypedMissings.jl, but it’s currently blocked by some hard issues in Julia:

3 Likes

From the blog article by Kamiński, which I’ve already cited here

Such opinions are always debatable, so recently I decided to run a small pool on Julia Discourse about the skipmissing function. The question was if we want to shorten the skipmissing name into something that is more convenient to use in interactive work. To my surprise, a vast majority of voters preferred a verbose and explicit operation name.

3 Likes

sm is a pretty small name compared to mskip. So I would take that poll with a grain of salt.

1 Like

This is horrifying to me, true/false is not missing data to me and I want to distinguish it from missing data.

That would imply the missing data is ==y in which case it’s not really missing; this would be dangerous if I decided to do .!((data_vector .< y_cutoff) .|| (data_vector .> y_cutoff)), probably in separate steps, instead of data_vector .== y_cutoff. barucden was implying these silent implications are dangerous and likely why operations on missing return missing.

5 Likes

I don’t want to get into a discussion about the exact semantics of missing in this thread. Obviously reaching agreement on these semantics is difficult, if not impossible. I only emphasize that missing is not a number and comparisons don’t need to imply completeness or transitivity.

I think the whole point here was that these decisions (missing three-valued logic, whether to skip or not, etc) were intended to be made by the developer, which is consistent with the approach taken in a large number of other cases. So, totally fine if you want to ignore missing values by default, that is your decision to make. Your complaint then is about how easy (or not) that is to accomplish.

2 Likes

This and your previous comments are indeed discussing the semantics of missing, so I’m not sure how you’d prefer avoiding the topic. If you mean you want to avoid the theory, sure, let’s talk practicality. I definitely think it’s impractical and buggy if operations on missing values return anything but missing by default. Having a missing value is important for skipping them, too. Say I have [false, true, false] and want to divide the number of trues by the number of comparisons of real data. How could I possibly know which false wasn’t computed from missing values?

3 Likes

It’s the data analyst’s job to ensure data integrity. The question “which observations contribute to this statistic” is something Julia, the language, can’t answer. The analyst should absolutely conduct additional robustness checks about how missing values are handled and what’s the appropriate way to deal with them.

The question is whether imposing skipmissing(...) or propagation on Boolean operations is the the right way to go about that. It’s costly for users to write skipmissing every time they wish to calculate the mean. I’m simply making an argument that the cost isn’t always worth the benefits.

1 Like

Couldn’t someone just write a package called SkipMissingByDefault thst implements the following?

module SkipMissingByDefault
mean(x::AbstractArray; kwargs...) = Statistics.mean(skipmissing(x); kwargs...)
end

Then the user can do the following.

using SkipMissingByDefault as sm
sm.mean(...)

Or they could just import it to use the modified SkipMissingByDefault.mean rather than Statistics.mean.

using SkipMissingByDefault: mean
mean(...)

I’m pretty sure Statistics.mean will not change its default behavior. That would be both breaking and likey lead to correctness mistakes. The most I could see would be a new keyword argument for convenience.

2 Likes

That argument is not parallel to whether missings should propagate something other than missing, in fact it undermines it. Say I want the average height in a polo team but only two people showed up [5.2, missing, 5.6, missing]. Well I had a yardstick, but everyone else is more comfortable with metric, so I convert feet to centimeters…and the missings get replaced with dummy values [158.496, 0.0, 170.688, 0.0] (it could return false, but it just gets auto-converted to Float64). SkipMissingByDefault.mean can’t do any different than Statistics.mean anymore. But wait, maybe we can skip all zero(T), after all nobody is 0cm tall. But what about cases where zeros are valid data, or what if a type doesn’t have a zero? It’d be nice if we had a dedicated singleton type to skip in folding operations, whether manually or automatically, and they didn’t vanish in the preceding elementwise scalar operations.

1 Like

I would consider that approach worse than what we currently have in Missings.jl

SkipMissingByDefault would have to write new versions of all Base and Statistics functions, and it wouldn’t be able to handle new third-party packages.

By contrast, skipmissing and the future spreadmissings features will allow users to interact with any function that allows iterators or AbstractVectors, even those not written yet.

But yes, we do need to write safer methods for ::SkipMissing in order to protect against floating point summation differences.

Our current approach is very flexible and leverages Julia’s multiple dispatch and functions-as-objects approach nicely.

I’m not arguing that missing / x should evaluate to false, which you seem to be implying by this post. I’m discussing a scenario where x < y should imply “We are absolutely sure that x is less than y”. This obviously has some costs, I admit. Also, see my above post where I advocate for macro-based approaches to make this easier.

Mostly I think this is just a more generic problem, of poor support for . The problem is Julia uses the iterator interface for everything vector-related (functional primitives like map, filter, reduce, etc. all live in Iterators.X), even though most data manipulation in Julia works on vectors, rather than lazy one-at-a-time iterators.

Transducers.jl and OnlineStats.jl fix this, but most people don’t use them because they’re not in the stdlib/StatsBase, respectively.