Why are missing values not ignored by default?

We can solve the quantile?(x, 0.1) issue by making things even more controversial—all we have to do is add underscore currying:

quantile(_, 0.1)?

Or you could attach ? to the arguments you want to participate in skipmissings:

cor(x?, y?)
quantile(x?, 0.1)

Or why not combine currying and skipmissings in one fell swoop?

quantile(?, 0.1)

enabling

combine(df, [:x] => quantile(?, 0.1))
combine(df, [:x, :y] => cor(?, ?))
3 Likes

That would be nice. But given the difficulty of getting underscore currying merged, it seems unlikely we could ever get cor(?, ?) merged…

Also, if you want to directly apply mean(?) to a vector, rather than passing it to combine, it would look like this:

mean(?)(x)

which looks slightly goofy to me. :thinking:

It might be better to keep currying and skip-missing as orthogonal features.

Yeah, I think the curry-and-skip would only make sense as an extra convenience if you already have both underscore currying like mean(_) and per-argument skipmissing without currying like mean(x?), so typing mean(?)(x) would technically work but make about as much sense as mean(_)(x). And on that note, a more intuitive syntax for combined curry-and-skip might be mean(_?).

Probably not much hope of having this in the base language, but if the parser can accept it there could be a macro @? in Missings.jl?

I think that smean, ssum, etc is not useful. You can always define those shortcuts yourself, but it’s more like a patch and non-Julian, as someone indicated above.

But the idea of ? sounds appealing. I think the solution should handle missing, and hence include operators. One application of this is comparison operators in Boolean indexing. Eg:


df[df.col1 .<0 .&& df.col2 .< 0.5, :] # not possible with missing

df[isless.(df.col1, 0) .&& isless.(df.col2, 0.5), :] # what you need now

Functions like isless and isequal are inconvenient, not only because they deviate from how you perform logical comparisons. Also because they don’t incorporate all the possible comparisons, making you define your own implementations for cases like “isless or equal” (Julia has isgreater, which is unexported and isn’t documented, not sure why).

So it’be really nice to have a ? operator that works like this:


df[df.col1 .<? 0 .&& df.col2 .<? 0.5, :]

#equivalent to

df[isless.(df.col1,0) .&& isless.(df.col2, 0.5), :]

The behavior of isless is such that:


x = [1, missing, 2]

y = [0, missing, missing]

isless.(x,y)

3-element BitVector:
  0
  0
  1

so, I think it should replicate a behavior like this.

I like the symbol ? because it’s short and already used for missings in terms like Float64?.

With comparison operators, it gets complicated again. You assume that one always wants to coalesce missing to false because the result will be used in a boolean indexing operation where you’re not interested in missings. But what if my selector is more easily expressed as a negation, or a union instead of an intersection? For example, if I want the complement of your selection (but still skipping the missings), I might try df[.!(df.col1 .<? 0 .& df.col2 .<? 0.5), :] or df[df.col1 .>=? 0 .| df.col2 .>=? 0.5, :], but neither is correct.

The general solution is to wrap the entire expression in coalesce, like df[coalesce.(df.col1 .> 0 .| df.col2 .> 0.5, false), :]. I think this is what you’d want to find a short-hand syntax for, rather than having modified comparison operators.

I think the best way to handle filtering a dataframe is to use subset (from DataFrames) or @rsubset (from DataFramesMeta), since they drop rows where the predicate returns missing. Using subset, this would be

subset(
    df,
    :col1 => ByRow(<(0)),
    :col2 => ByRow(<(0.5));
    skipmissing=:true
)

The syntax with @rsubset is pretty nice:

@rsubset(df, :col1 < 0, :col2 < 0.5)

It’d have the same problem as in Julia Base now, right? If I understood well:

import Base: isgreater

.!(isless.(x,3) .&& isless.(x,3) .&& (!isequal).(x,3) .&& (!isequal).(y,3))
3-element BitVector:
 0
 1
 0

(isgreater.(x,3) .&& isgreater.(x,3))
3-element BitVector:
 0
 0
 0

Not entirely sure about your example code, but the point is that it’s incorrect to coalesce to false per comparison, because that false may be flipped to a true by subsequent negation or intersection union, so you end up selecting a row that contains missing after all.

julia> x, y = [0.0], [missing];

julia> isless.(x, 1.0) | isless.(y, 1.0)
1-element BitVector:
 1

You need to propagate missing through your entire boolean indexing expression, and then coalesce missing to false. But as @CameronBieganek showed, DataFrames(Meta) already provides syntax for this that’s nicer than direct indexing.

1 Like

I’m a big fan of ? in identifiers myself. I think f? makes the most sense for predicates, but I could see ?? being a bit like BangBang.jl’s !! (just as ! relates to !!, youy could say that ?? is is used to indicate filtering with the missing predicate) but used for the “missing skip” functions we’re talking about here.

1 Like

Operators won’t be covered and only a bunch of arbitrary functions will be included. So, in my opinion, maybe the best solution is to leave it as is. Otherwise, we run the risk of making code in DataFrames confusing, without actually addressing the issue.

Nonetheless, it could be good to at least suggest in the documentation of DataFrames some convention to denote functions that skip missing values, such as :exclamation: at the end of a function (Tab Completion :ex). This would be similar to !, in the sense of saying “be careful that you’ll mutate one of the arguments”.

This at least unifies notation, and so you can easily recognize what the person writing the code does when you see (the symbol below should appear in red)

transform!(df, :x => mean❗ => :x_mean)
transform!(df, :x => share❗ => :x_mean)

Hmm, allowing ? in identifiers seems like a waste of good syntax, in my opinion. :slight_smile:

1 Like

Not a data scientist here!

I find your example peculiar: why is the first mean NaN? And how can it be fixed with skipmissing? (None of the values are really missing, having been generated by rand…)

Edit: OK, now this I get.


julia> sum(x)
Inf

julia> sum(x) / length(x)
NaN

This I don’t get:

julia> sum(skipmissing(x))
Inf

julia> length(skipmissing(x))
ERROR: MethodError: no method matching length(::Base.SkipMissing{Vector{Float16}})
Closest candidates are:

How does it follow that the mean is zero?

I thought there was a typo but turns out this is one of the cases where isless isn’t quite <. Using NaNs for consistency:

julia> isless.([1, NaN, 2], [0, NaN, NaN]) |> println
Bool[0, 0, 1]

julia> (<).([1, NaN, 2], [0, NaN, NaN]) |> println
Bool[0, 0, 0]

julia> isless(2, NaN), isless(NaN, 2) # for sorting
(true, false)

julia> <(2, NaN), <(NaN, 2) # IEEE-754
(false, false)

That’s the thing though, isless already doesn’t do that for NaN or missing. The propagated missing alone cannot determine whether it is replaced with true or false to let the operation mimick isless, you need the order of the inputs.

So unfortunately for both sides of this particular subtopic: 1) there isn’t 1 way to impute across all operations, so it’s a poor fit for 1 general ? operator, and 2) you can’t always handle missing after propagation. Implementing these 2 behaviors with higher-order functions:

julia> function missfalse(op, a, b)
         val = op(a, b)
         ismissing(val) ? false : val
       end
missfalse (generic function with 1 method)

julia> function missbigger(op, a, b)
         # isless-like, can't do after op propagates
         if ismissing(a) return false end
         if ismissing(b) return true end
         op(a, b)
       end
missbigger (generic function with 1 method)

julia> missfalse.(<, [1, missing, 2], [0, missing, missing]) |> println
Bool[0, 0, 0]

julia> missfalse.(+, [1, missing, 2], [0, missing, missing]) |> println
Integer[1, false, false]

julia> missbigger.(<, [1, missing, 2], [0, missing, missing]) |> println
Bool[0, 0, 1]

julia> missbigger.(+, [1, missing, 2], [0, missing, missing]) |> println
Integer[1, false, true]

Though since isless and < are different by design, I wouldn’t recommend modifying < to resemble isless.

Don’t use isless. isless is not the proper tool for handling missing data. isless was introduced into the discussion above as an attempt to handle missing propagation, but isless was never meant to be a tool for handling missing propagation. The general approach is

  1. Let missing propagate as far as possible.
  2. When you can’t let it propagate any farther, use one of the following tools:
    • coalesce
    • ismissing
    • skipmissing
    • DataFrames.subset
    • DataFramesMeta.@subset
    • etc (but not isless!)
5 Likes

One could say the same thing about ! :upside_down_face:. I’m not sure if it would actually conflict with the ternary syntax (it feels like it probably would, but on the chance it doesn’t this seems worth mentioning), but I wonder if it would be possible to have a ? unary operator like ! if it was “released to the identifiers”. It would be cool to define ?(::Type{T}) = Union{Missing, T} and ?(f::Function) = (args...) -> f(map(skipmissing, args)...).

Then one could do things like:

function myrepeat(thing::String, repeats::?Int)
    [thing for _ in 1:coalesce(repeats, 1)]
end

?mean([1, missing, 2])

and also enable predicate? naming at the same time.

Infix operators go between arguments a+b and must be binary. You’re talking about prefix unary operators, and they have lower precedence than function call syntax (seems to go ^ < :: < calls < .) so parentheses are necessary to avoid errors or mistaken function calls. For example, !isnan(x) does boolean negation !(isnan(x)), not (!isnan)(x), you can check with Meta.@lower.

Roughly:

julia> foldl(+, x) / length(x)
Float16(0.0)

Ah yea, oops yea I meant a “unary ? operator”. Pity the operator precedence doesn’t work out nicely here, I suppose the corrected example would actually be (?mean)([1, missing, 2]) which isn’t as nice, but still seems decent to me.

Maybe the other way around then, ?name for a name (although that’s maybe a bit weird given the existing naming rules) and name? for a postfix operator which will come before a call name?(args...).