Why are missing values not ignored by default?

CameronBieganek · November 28, 2023, 8:02pm

Still, I don’t think skip(mean)(x) is any more convenient than mean(skipmissing(x)), and when you add a keyword argument to skip it gets pretty ugly:

skip(cor; use=:pairwise_complete_obs)(X)

(R has 4 skip-missing options for cor, in addition to the default of just propagating NA.)

pdeffebach · November 28, 2023, 8:08pm

skip() is shorter than na.rm=T, so I think we at least get to R parity with this change. My assumption is that skip(cor)(x, y) would always use complete observations of x and y jointly. I think a lot of the benefit stems from multi-argument functions, where skipmissings requires splatting and collecting.

jar1 · November 28, 2023, 8:13pm

It’s not really related to generic programming. These are different functions in different modules – that the functions happen to have the same name is fine.

CameronBieganek · November 28, 2023, 8:16pm

I understand that it works, but it is highly misleading to someone reading the code when they see mean(x). That’s why I said it goes against the spirit of generic programming in Julia.

dlakelan · November 28, 2023, 8:20pm

agreed, I’d probably prefer something like:


let mean = sm.mean, sum=sm.sum
   ...tens or hundreds of lines of code goes here...
end

Of course, you could also do:

@sm begin
... tens or hundreds of lines of code goes here
end

and let the sm macro set things up for you

nalimilan · November 28, 2023, 8:23pm

I also think that macros are the best solution. @pdeffebach Don’t you think that better macros in DataFramesMeta would greatly alleviate the pain of working with missing values? Ideally one could add at the top of a block @passmissing, @skipmissing or a new macro which would combine both behaviors, and everything would work.

CameronBieganek · November 28, 2023, 8:24pm

What about skip(cor)(X), where X is a matrix like this?

3×3 Matrix{Union{Missing, Int64}}:
 1         2          missing
 3          missing  4
  missing  5         6

Then you have to decide if you want to use complete observations or pairwise-complete observations.

pdeffebach · November 28, 2023, 8:27pm

I agree in the sense that @passmissing, or just @pass and @skipmissing, or just @skip would be useful. But these would be thin wrappers around passmissing etc. rather than a general find-and-replace macro people are suggesting.

This is very easy to do in DataFramesMeta because DataFrames transformations have a single function as their entry point.

I don’t think that a @sm begin ... end macro is feasible. Clearly we don’t want passmissing to wrap every single function in a block, or to use skipmissing for everything. Outside of a few functions like mean and cor, it seems not easy for a macro to determine what those functions should be.

dlakelan · November 28, 2023, 8:30pm

have it be an explicit argument to the macro which is optional…

@sm [:mean,:cor] begin
...

end

## or, get the default behavior which would be documented

@sm begin

...

end

The biggest issue is the issue with complete observations etc

nalimilan · November 28, 2023, 8:32pm

Yet it seems that spreadmissings would cover a lot of use cases, right? We could have @spreadmissings which would wrap all functions applied to data frame columns in spreadmissings. Of course there are exceptions like quantile(col, [0.1, 0.9]) but these are relatively rare.

pdeffebach · November 28, 2023, 8:35pm

Yes. I need to finish the PR. Hopefully that will make things a lot easier.

adienes · November 28, 2023, 8:38pm

I think it would really be a shame to converge on a “solution” that does not include making quantile nice. the main motivating functions for me are moments, quantiles, and cov/cor

nalimilan · November 28, 2023, 8:41pm

Quantiles are super special as it’s one of the rare statistical functions which takes a vector of data and a vector of quantile positions. But we could work around this using various solutions, including passing a tuple instead of a vector of positions so it’s not too much of a problem.

pdeffebach · November 28, 2023, 8:44pm

Very easy. With the current implementation of spreadmissings We would just need to have Ref or some sort of other wrapper to protect the [.1, .9].

jmboehm · November 28, 2023, 9:03pm

For what it’s worth (probably not much), I also feel the pain shared by several others of having to use skipmissings to the point of avoiding Julia for data work largely because of that. I also have the impression that the place to address this is at the level of DataFramesMeta or whatever other package people (may) use to actually manipulate their data (Query, etc), as long as this is technically feasible.

mkitti · November 28, 2023, 9:07pm

I agree, which is why I would personally tend to stick with the following.

using SM: SM

# ...

SM.mean(x)

Earlier I showed how one might use a module to contain the modified definiton close to the call site, and I would still highly recommend that. For I’m emphasis I will do it again in this context.

module MySkipMissing
    using Missings.SkipMissing: mean
    function mean_squared(x)
        mean(x)^2
    end
end

At the end of the day though, I would still be willing to accommodate the behavior even if I do not agree with it. For quick experiments, I could see why someone might want to replace the aggregation methods quickly.

In the larger scope, what I want to demonstrate here is how flexible Julia is and the range of preferences that Julia can accommodate in a modular fashion.

skypuppy · November 28, 2023, 9:15pm

This will be an unpopular solution, but as someone who has designed and worked many databases over the years, I propose that a method of incorporating a “standard” -missing- value, while at the same time, allowing each designer/user to establish their own substitution on a per program or per function basis. Anyone not using the “standard” is responsible for how it is handled in their code. Further, also allow a Julia-wide version of -missing- that is a parameter that can be defined once in their environment and is set from thereon in their future. That person is responsible for propagating that value in any code or function where it is used. Their own “environment variable” as it were.

Joseph_Bradley · November 28, 2023, 9:33pm

I strongly disagree with the idea that you always know whether there are missings in your data.

Furthermore, it’s good data science to document how you deal with missing data careful. It’s so annoying reading papers that clearly have some strategy to handle missing data that don’t make it explicit - you are then stuck guessing said strategy whenever you want to reproduce something.

CameronBieganek · November 28, 2023, 9:46pm

This,

module MySkipMissing
    using Missings.SkipMissing: mean
    function mean_squared(x)
        mean(x)^2
    end
end

MySkipMissing.mean_squared([1, missing, 2])

seems needlessly verbose and complex compared to this:

using Missings

function mean_squared(x)
    smean(x)^2
end

mean_squared([1, missing, 2])

But ultimately we’re just bikeshedding now.

alfaromartino · November 28, 2023, 9:46pm

I don’t think that adding macros or shorten names to function would solve the problem. In fact, I have a file that I load that implements some of the suggestions.

From all the options, the only one that makes sense to me is adding a new type. The reasons:

My concern is not about missing. Sometimes, I need to be careful about the treatment of missing. Rather, my concern is that missing isn’t ideal for some type of work, in which case they hinder the writing and reading of the code.
Ideally, sometimes you want to say: "these types of missings in particular are uniformative, so I don’t want to keep indicating this when I write x.>y, take the mean, cor, or whatever.

This also leads to clean code, because you tell the reader “thee missing in this scenario should be ignored, and I’m certain that you can do it for all the code”.

Other options are troublesome, in particular for new users. When you start to add smean or short names like that, then someone who reads code ends up thinking “what smean does?”. The option of directly shadowing mean I consider it even more problematic, because you don’t know what type of mean is used, unless you search for whether a corresponding package was loaded.

In this sense is that I would’ve preferred to have ignore missing as the default, while still having both behaviors as options. This means isequal exchanging roles with ==, or have keepmissing instead of skipmissing.

Upon reflection after all the discussion, I think the best solution is adding a new type. This extends the possibilities, without restricting the ones we have.

Topic		Replies	Views
What workflows for missing values are more ergonomic in Julia? Internals & Design	2	377	November 30, 2023
How does StatsBase.skewness work? Data	29	2632	January 29, 2019
A modest `missing`s 2.0 proposal Data	20	1204	October 31, 2020
Missing or NaN General Usage	26	12343	August 1, 2018
DataFramesMeta.jl and the state of the DataFrames ecosystem Data	36	4030	April 24, 2020

Why are missing values not ignored by default?

Related topics