Why are missing values not ignored by default?

Ah good point. Then we need something like this:

struct MissingSkipper{T}
    f::T
end

(s::MissingSkipper)(args...; kwargs...) =
    (s.f)(skipmissings.(args)...; kwargs...)

Maybe that was exactly @danielwe’s point and I misunderstood, sorry.

This will have the behavior you want by default, but we can still overload functions on the SkipMissings type if necessary. And if we want we can still overload (s::MissingSkipper{typeof(quantile)})() for example to have quantile? add skipmissing only to the first argument.

Yes, See here

1 Like

I’ve only ever used skipmissings to exclude an iteration for any missing in all of the iterables, is there a way for it to considerly only some? For example, x = [1, 2, missing, 4]; y = [1, 2, 3, missing]; z = [1, missing, 3, 4] only considering x and y would result in [1, 2], [1, 2], [1, missing].

No, there is no way to do that at the moment. I’m not sure how I would implement that, but please file an issue with motivation at Missings.jl.

Absolutely. But I don’t think it’s a good idea to have each function define its own way of handling SkipMissing. Let me clarify what I meant by the cor?(x?, y?) syntax by modifying your example from earlier:

struct MissingSkipper{T}
    f::T
end

var"'ˢ"(x::AbstractArray) = skipmissing(x)
var"'ˢ"(f::Base.Callable) = MissingSkipper(f)

function (s::MissingSkipper)(args...; kwargs...)
    skipargs = skipmissings((a.x for a in args if a isa SkipMissing)...)
    allargs = let skipiter=iterate(skipargs)
        ntuple(length(args)) do i
            a = args[i]
            if a isa SkipMissing
                s, state = skipiter
                skipiter = iterate(skipargs, state)
                return s
            end
            return a
        end
    end
    return s.f(allargs...; kwargs...)
end

The point is that, yes, you attach SkipMissing as a flag on the arguments, but its regular semantics remain as-is and you don’t try to add clever overloads to particular functions. Instead, the MissingSkipper function wrapper is now used as the one canonical way to modify any function to do joint instead of separate skipping on any arguments that are of type SkipMissing, giving you shorthand syntax like f?(x?, y, z?).

I don’t think this is a good idea, however. Better to apply to all arguments, like broadcasting, and use manual escape hatches for arguments that shouldn’t participate, like g(_x, _z) = f(_x, y, _z)); g?(x, z), or maybe special-casing Ref such that you can do f?(x, Ref(y), z).

I see, I misunderstood what you meant, my bad.

I just want to point out that realistically we would need two different symbols for this. We can only pick one generic definition of ?, either

?(itr) = skipmissing(itr)

or

?(f) = (args...; kwargs...) -> f(skipmissings(args...)...; kwargs...)

We can’t have both.

I haven’t thought through the details yet, but special casing Ref might actually be a decent approach. Or we could use some other wrapper with a short name, like Ign (for “ignore”). Or maybe even a symbol, like this, ↑(y).

I think ignoring arguments for missing-skipping is less common than including arguments, so it makes sense to only use the extra verbosity when ignoring arguments. That way many calls would only need one extra symbol:

mean?(x)
sum?(x)
cor?(x, y)
1 Like

Here’s a nice little syntactic goodie:

struct NoSkip{T}
    x::T
end

const ¬ = NoSkip

So if you have the function foo,

foo(x, y, z) = z * (mean(x) + mean(y))

then you can write this:

a, b, c = # ...
foo?(a, b, ¬c)

The ¬ symbol lets the ? know not to include that argument in the missing skipping.

What about ¿ for NoMissings? (which is AltGr+? on my keyboard)

Unfortunately, the discussion around cor?(x, y) is moot if Base devs won’t allow a cor method that accepts iterators for x and y. We could define ? in terms of a version of Missings.skipmissings that returns arrays instead of iterators, but that feels clunky and wasteful to me.

Or I guess we could add yet another symbol to signal that the input to a function argument needs to be collected. It’s kinda goofy, but not too bad. Here’s the setup:

struct NeedsArray{T}
    x::T
end

const ⋆ = NeedsArraay

And here’s what it would look like in action:

cor?(⋆x, ⋆y)

The problem is if x,y have missing in different places then skipmissing on each one produces an unaligned sequence… You really want cor that takes an iterator on tuples and then a filter that does skipping if either of the tuple values are missing.

  cor(x::AbstractVector)


  Return the number one.

already exists, for some reason.

1 Like

Yes, that’s what Missings.skipmissings does, as has been discussed a few times already. And my pseudo implementations above for ? use skipmissings (not skipmissing).

1 Like

But, happily, there is no cor(::Any) method. :slight_smile:

EDIT: Although I guess that would preclude using cor([(1, 2), (3, 4), (5, 6)]). :sob:

Aha, dangerous to have two functionalities only separated by pluralization.

1 Like

That’s really not gonna fly. We can’t have cor((x,y)) = cor(collect.(skipmissings(x,y))...) and cor([x,y]) = 1.0. It would be a huge trap.

I edited my reply as you were responding. However, to clarify, I’m not advocating for

cor((x, y)) = cor(collect.(skipmissings(x, y))...)

I was just suggesting a cor method that takes an iterator of observations (with no missings), e.g.

itr = ((1, 4), (3, 2), (5, 8), (7, 6))
cor(itr)

But sadly that wouldn’t work so well, since as you pointed out we would get a different result if we wrote it like this:

itr = [(1, 4), (3, 2), (5, 8), (7, 6)]
cor(itr)

The skipmissings part that you are showing would not be part of the definition of cor—it would be opted into by something, e.g. cor(magic(x, y)), or some other syntax.

EDIT: I guess I haven’t advocated for cor(itr) here, that discussion occurred on Github. It was @dlakelan who mentioned cor taking an iterator of tuples.

I’d like to express some perplexity about the special notation discussed for skipmissing. I’m a data scientist, I have to write and read code where missings should be skipped a lot of times*.

One thing that is good about the current verbose solution is that is extremely readable. mean(skipmissing(x)) doesn’t leave any doubt to its meaning. And, in my experience at least, the time taken to write “skipmissing” is negligible compared to the time it takes to understand a shorter, but opaque, solution with postfix operators, and ambiguous symbols (there is not a strong reason why “?” should skip missing, and not, for example, allow for a function to safely fail without erroring, or anything else).

If the advantage is some seconds spared while writing code, and the drawback is a loss of readability, I wouldn’t support the change.

  • Again, in my experience, exactly what to do with missings is rarely the same. Using things like subset and writing the behavior explicitly is a great way to document the code.
11 Likes

I respectfully disagree that you know when there are missing values. E.g. after any data transformation/merge/pivot missing values can creep in. IMO the comprehensive management of NA (missing, Not Available) in the stat language R has long been one of it’s amazing features.
The standard there is generally to propagate missing/NAs by default, and optionally ignore them. Thanks.

> sum(c(1, NA, 2))
[1] NA
> sum(c(1, NA, 2), na.rm = TRUE)
[1] 3
> mean(c(1, NA, 2))
[1] NA
> mean(c(1, NA, 2), na.rm = TRUE)
[1] 1.5