Why are missing values not ignored by default?

Still, I don’t think skip(mean)(x) is any more convenient than mean(skipmissing(x)), and when you add a keyword argument to skip it gets pretty ugly:

skip(cor; use=:pairwise_complete_obs)(X)

(R has 4 skip-missing options for cor, in addition to the default of just propagating NA.)

skip() is shorter than na.rm=T, so I think we at least get to R parity with this change. My assumption is that skip(cor)(x, y) would always use complete observations of x and y jointly. I think a lot of the benefit stems from multi-argument functions, where skipmissings requires splatting and collecting.

It’s not really related to generic programming. These are different functions in different modules – that the functions happen to have the same name is fine.

1 Like

I understand that it works, but it is highly misleading to someone reading the code when they see mean(x). That’s why I said it goes against the spirit of generic programming in Julia.

agreed, I’d probably prefer something like:


let mean = sm.mean, sum=sm.sum
   ...tens or hundreds of lines of code goes here...
end

Of course, you could also do:

@sm begin
... tens or hundreds of lines of code goes here
end

and let the sm macro set things up for you

2 Likes

I also think that macros are the best solution. @pdeffebach Don’t you think that better macros in DataFramesMeta would greatly alleviate the pain of working with missing values? Ideally one could add at the top of a block @passmissing, @skipmissing or a new macro which would combine both behaviors, and everything would work.

3 Likes

What about skip(cor)(X), where X is a matrix like this?

3×3 Matrix{Union{Missing, Int64}}:
 1         2          missing
 3          missing  4
  missing  5         6

Then you have to decide if you want to use complete observations or pairwise-complete observations.

I agree in the sense that @passmissing, or just @pass and @skipmissing, or just @skip would be useful. But these would be thin wrappers around passmissing etc. rather than a general find-and-replace macro people are suggesting.

This is very easy to do in DataFramesMeta because DataFrames transformations have a single function as their entry point.

I don’t think that a @sm begin ... end macro is feasible. Clearly we don’t want passmissing to wrap every single function in a block, or to use skipmissing for everything. Outside of a few functions like mean and cor, it seems not easy for a macro to determine what those functions should be.

have it be an explicit argument to the macro which is optional…

@sm [:mean,:cor] begin
...

end

## or, get the default behavior which would be documented

@sm begin

...

end 

The biggest issue is the issue with complete observations etc

Yet it seems that spreadmissings would cover a lot of use cases, right? We could have @spreadmissings which would wrap all functions applied to data frame columns in spreadmissings. Of course there are exceptions like quantile(col, [0.1, 0.9]) but these are relatively rare.

Yes. I need to finish the PR. Hopefully that will make things a lot easier.

1 Like

I think it would really be a shame to converge on a “solution” that does not include making quantile nice. the main motivating functions for me are moments, quantiles, and cov/cor

Quantiles are super special as it’s one of the rare statistical functions which takes a vector of data and a vector of quantile positions. But we could work around this using various solutions, including passing a tuple instead of a vector of positions so it’s not too much of a problem.

Very easy. With the current implementation of spreadmissings We would just need to have Ref or some sort of other wrapper to protect the [.1, .9].

For what it’s worth (probably not much), I also feel the pain shared by several others of having to use skipmissings to the point of avoiding Julia for data work largely because of that. I also have the impression that the place to address this is at the level of DataFramesMeta or whatever other package people (may) use to actually manipulate their data (Query, etc), as long as this is technically feasible.

4 Likes

I agree, which is why I would personally tend to stick with the following.

using SM: SM

# ...

SM.mean(x)

Earlier I showed how one might use a module to contain the modified definiton close to the call site, and I would still highly recommend that. For I’m emphasis I will do it again in this context.

module MySkipMissing
    using Missings.SkipMissing: mean
    function mean_squared(x)
        mean(x)^2
    end
end

At the end of the day though, I would still be willing to accommodate the behavior even if I do not agree with it. For quick experiments, I could see why someone might want to replace the aggregation methods quickly.

In the larger scope, what I want to demonstrate here is how flexible Julia is and the range of preferences that Julia can accommodate in a modular fashion.

1 Like

This will be an unpopular solution, but as someone who has designed and worked many databases over the years, I propose that a method of incorporating a “standard” -missing- value, while at the same time, allowing each designer/user to establish their own substitution on a per program or per function basis. Anyone not using the “standard” is responsible for how it is handled in their code. Further, also allow a Julia-wide version of -missing- that is a parameter that can be defined once in their environment and is set from thereon in their future. That person is responsible for propagating that value in any code or function where it is used. Their own “environment variable” as it were.

1 Like

I strongly disagree with the idea that you always know whether there are missings in your data.

Furthermore, it’s good data science to document how you deal with missing data careful. It’s so annoying reading papers that clearly have some strategy to handle missing data that don’t make it explicit - you are then stuck guessing said strategy whenever you want to reproduce something.

2 Likes

This,

module MySkipMissing
    using Missings.SkipMissing: mean
    function mean_squared(x)
        mean(x)^2
    end
end

MySkipMissing.mean_squared([1, missing, 2])

seems needlessly verbose and complex compared to this:

using Missings

function mean_squared(x)
    smean(x)^2
end

mean_squared([1, missing, 2])

But ultimately we’re just bikeshedding now. :slight_smile:

2 Likes

I don’t think that adding macros or shorten names to function would solve the problem. In fact, I have a file that I load that implements some of the suggestions.

From all the options, the only one that makes sense to me is adding a new type. The reasons:

  1. My concern is not about missing. Sometimes, I need to be careful about the treatment of missing. Rather, my concern is that missing isn’t ideal for some type of work, in which case they hinder the writing and reading of the code.
    Ideally, sometimes you want to say: "these types of missings in particular are uniformative, so I don’t want to keep indicating this when I write x.>y, take the mean, cor, or whatever.

This also leads to clean code, because you tell the reader “thee missing in this scenario should be ignored, and I’m certain that you can do it for all the code”.

  1. Other options are troublesome, in particular for new users. When you start to add smean or short names like that, then someone who reads code ends up thinking “what smean does?”. The option of directly shadowing mean I consider it even more problematic, because you don’t know what type of mean is used, unless you search for whether a corresponding package was loaded.

In this sense is that I would’ve preferred to have ignore missing as the default, while still having both behaviors as options. This means isequal exchanging roles with ==, or have keepmissing instead of skipmissing.

Upon reflection after all the discussion, I think the best solution is adding a new type. This extends the possibilities, without restricting the ones we have.

1 Like