Why are missing values not ignored by default?

CameronBieganek · November 30, 2023, 10:38pm

We can solve the quantile?(x, 0.1) issue by making things even more controversial—all we have to do is add underscore currying:

quantile(_, 0.1)?

danielwe · November 30, 2023, 10:44pm

Or you could attach ? to the arguments you want to participate in skipmissings:

cor(x?, y?)
quantile(x?, 0.1)

Or why not combine currying and skipmissings in one fell swoop?

quantile(?, 0.1)

enabling

combine(df, [:x] => quantile(?, 0.1))
combine(df, [:x, :y] => cor(?, ?))

CameronBieganek · November 30, 2023, 11:08pm

That would be nice. But given the difficulty of getting underscore currying merged, it seems unlikely we could ever get cor(?, ?) merged…

Also, if you want to directly apply mean(?) to a vector, rather than passing it to combine, it would look like this:

mean(?)(x)

which looks slightly goofy to me.

It might be better to keep currying and skip-missing as orthogonal features.

danielwe · November 30, 2023, 11:20pm

Yeah, I think the curry-and-skip would only make sense as an extra convenience if you already have both underscore currying like mean(_) and per-argument skipmissing without currying like mean(x?), so typing mean(?)(x) would technically work but make about as much sense as mean(_)(x). And on that note, a more intuitive syntax for combined curry-and-skip might be mean(_?).

Probably not much hope of having this in the base language, but if the parser can accept it there could be a macro @? in Missings.jl?

alfaromartino · November 30, 2023, 11:22pm

I think that smean, ssum, etc is not useful. You can always define those shortcuts yourself, but it’s more like a patch and non-Julian, as someone indicated above.

But the idea of ? sounds appealing. I think the solution should handle missing, and hence include operators. One application of this is comparison operators in Boolean indexing. Eg:


df[df.col1 .<0 .&& df.col2 .< 0.5, :] # not possible with missing

df[isless.(df.col1, 0) .&& isless.(df.col2, 0.5), :] # what you need now

Functions like isless and isequal are inconvenient, not only because they deviate from how you perform logical comparisons. Also because they don’t incorporate all the possible comparisons, making you define your own implementations for cases like “isless or equal” (Julia has isgreater, which is unexported and isn’t documented, not sure why).

So it’be really nice to have a ? operator that works like this:


df[df.col1 .<? 0 .&& df.col2 .<? 0.5, :]

#equivalent to

df[isless.(df.col1,0) .&& isless.(df.col2, 0.5), :]

The behavior of isless is such that:


x = [1, missing, 2]

y = [0, missing, missing]

isless.(x,y)

3-element BitVector:
  0
  0
  1

so, I think it should replicate a behavior like this.

I like the symbol ? because it’s short and already used for missings in terms like Float64?.

danielwe · November 30, 2023, 11:38pm

With comparison operators, it gets complicated again. You assume that one always wants to coalesce missing to false because the result will be used in a boolean indexing operation where you’re not interested in missings. But what if my selector is more easily expressed as a negation, or a union instead of an intersection? For example, if I want the complement of your selection (but still skipping the missings), I might try df[.!(df.col1 .<? 0 .& df.col2 .<? 0.5), :] or df[df.col1 .>=? 0 .| df.col2 .>=? 0.5, :], but neither is correct.

The general solution is to wrap the entire expression in coalesce, like df[coalesce.(df.col1 .> 0 .| df.col2 .> 0.5, false), :]. I think this is what you’d want to find a short-hand syntax for, rather than having modified comparison operators.

CameronBieganek · November 30, 2023, 11:40pm

I think the best way to handle filtering a dataframe is to use subset (from DataFrames) or @rsubset (from DataFramesMeta), since they drop rows where the predicate returns missing. Using subset, this would be

subset(
    df,
    :col1 => ByRow(<(0)),
    :col2 => ByRow(<(0.5));
    skipmissing=:true
)

The syntax with @rsubset is pretty nice:

@rsubset(df, :col1 < 0, :col2 < 0.5)

alfaromartino · November 30, 2023, 11:49pm

It’d have the same problem as in Julia Base now, right? If I understood well:

import Base: isgreater

.!(isless.(x,3) .&& isless.(x,3) .&& (!isequal).(x,3) .&& (!isequal).(y,3))
3-element BitVector:
 0
 1
 0

(isgreater.(x,3) .&& isgreater.(x,3))
3-element BitVector:
 0
 0
 0

danielwe · December 1, 2023, 12:02am

Not entirely sure about your example code, but the point is that it’s incorrect to coalesce to false per comparison, because that false may be flipped to a true by subsequent negation or ~~intersection~~ union, so you end up selecting a row that contains missing after all.

julia> x, y = [0.0], [missing];

julia> isless.(x, 1.0) | isless.(y, 1.0)
1-element BitVector:
 1

You need to propagate missing through your entire boolean indexing expression, and then coalesce missing to false. But as @CameronBieganek showed, DataFrames(Meta) already provides syntax for this that’s nicer than direct indexing.

tecosaur · December 1, 2023, 12:18am

I’m a big fan of ? in identifiers myself. I think f? makes the most sense for predicates, but I could see ?? being a bit like BangBang.jl’s !! (just as ! relates to !!, youy could say that ?? is is used to indicate filtering with the missing predicate) but used for the “missing skip” functions we’re talking about here.

github.com/JuliaLang/julia

Allow '?' in variable and function names

opened 10:51PM - 22 May 17 UTC

nsmith5

speculative parser design

Currently Julia does not support question marks inside of variable names. This w…as discussed a while ago , with support, but never pursued (#1539). The major implication of allowing this is that you can write predicate functions with a question mark. (eg. `integer?(x)` instead of `isinteger(x)`). A little more discussion can be found on [discourse](https://discourse.julialang.org/t/question-mark-in-variable-names/3836/13). I wish I could accompany this with an RFC-like pull request, but I got quite lost in the parser in my attempt. If this is pursued I really think trailing question marks should be pursued as the convention for predicates. They indicate very clearly that the function asks a question and, as previously discussed, the `is-` predicates can be awkward, leading to inconsistent usage (eg. `contains`).

alfaromartino · December 1, 2023, 12:51am

Operators won’t be covered and only a bunch of arbitrary functions will be included. So, in my opinion, maybe the best solution is to leave it as is. Otherwise, we run the risk of making code in DataFrames confusing, without actually addressing the issue.

Nonetheless, it could be good to at least suggest in the documentation of DataFrames some convention to denote functions that skip missing values, such as at the end of a function (Tab Completion :ex). This would be similar to !, in the sense of saying “be careful that you’ll mutate one of the arguments”.

This at least unifies notation, and so you can easily recognize what the person writing the code does when you see (the symbol below should appear in red)

transform!(df, :x => mean❗ => :x_mean)
transform!(df, :x => share❗ => :x_mean)

CameronBieganek · December 1, 2023, 12:51am

Hmm, allowing ? in identifiers seems like a waste of good syntax, in my opinion.

PetrKryslUCSD · December 1, 2023, 1:00am

Not a data scientist here!

I find your example peculiar: why is the first mean NaN? And how can it be fixed with skipmissing? (None of the values are really missing, having been generated by rand…)

Edit: OK, now this I get.


julia> sum(x)
Inf

julia> sum(x) / length(x)
NaN

This I don’t get:

julia> sum(skipmissing(x))
Inf

julia> length(skipmissing(x))
ERROR: MethodError: no method matching length(::Base.SkipMissing{Vector{Float16}})
Closest candidates are:

How does it follow that the mean is zero?

Benny · December 1, 2023, 3:37am

I thought there was a typo but turns out this is one of the cases where isless isn’t quite <. Using NaNs for consistency:

julia> isless.([1, NaN, 2], [0, NaN, NaN]) |> println
Bool[0, 0, 1]

julia> (<).([1, NaN, 2], [0, NaN, NaN]) |> println
Bool[0, 0, 0]

julia> isless(2, NaN), isless(NaN, 2) # for sorting
(true, false)

julia> <(2, NaN), <(NaN, 2) # IEEE-754
(false, false)

That’s the thing though, isless already doesn’t do that for NaN or missing. The propagated missing alone cannot determine whether it is replaced with true or false to let the operation mimick isless, you need the order of the inputs.

So unfortunately for both sides of this particular subtopic: 1) there isn’t 1 way to impute across all operations, so it’s a poor fit for 1 general ? operator, and 2) you can’t always handle missing after propagation. Implementing these 2 behaviors with higher-order functions:

julia> function missfalse(op, a, b)
         val = op(a, b)
         ismissing(val) ? false : val
       end
missfalse (generic function with 1 method)

julia> function missbigger(op, a, b)
         # isless-like, can't do after op propagates
         if ismissing(a) return false end
         if ismissing(b) return true end
         op(a, b)
       end
missbigger (generic function with 1 method)

julia> missfalse.(<, [1, missing, 2], [0, missing, missing]) |> println
Bool[0, 0, 0]

julia> missfalse.(+, [1, missing, 2], [0, missing, missing]) |> println
Integer[1, false, false]

julia> missbigger.(<, [1, missing, 2], [0, missing, missing]) |> println
Bool[0, 0, 1]

julia> missbigger.(+, [1, missing, 2], [0, missing, missing]) |> println
Integer[1, false, true]

Though since isless and < are different by design, I wouldn’t recommend modifying < to resemble isless.

CameronBieganek · December 1, 2023, 5:22am

Don’t use isless. isless is not the proper tool for handling missing data. isless was introduced into the discussion above as an attempt to handle missing propagation, but isless was never meant to be a tool for handling missing propagation. The general approach is

Let missing propagate as far as possible.
When you can’t let it propagate any farther, use one of the following tools:
- coalesce
- ismissing
- skipmissing
- DataFrames.subset
- DataFramesMeta.@subset
- etc (but not isless!)

tecosaur · December 1, 2023, 5:56am

One could say the same thing about ! . I’m not sure if it would actually conflict with the ternary syntax (it feels like it probably would, but on the chance it doesn’t this seems worth mentioning), but I wonder if it would be possible to have a ? unary operator like ! if it was “released to the identifiers”. It would be cool to define ?(::Type{T}) = Union{Missing, T} and ?(f::Function) = (args...) -> f(map(skipmissing, args)...).

Then one could do things like:

function myrepeat(thing::String, repeats::?Int)
    [thing for _ in 1:coalesce(repeats, 1)]
end

?mean([1, missing, 2])

and also enable predicate? naming at the same time.

Benny · December 1, 2023, 6:07am

Infix operators go between arguments a+b and must be binary. You’re talking about prefix unary operators, and they have lower precedence than function call syntax (seems to go ^ < :: < calls < .) so parentheses are necessary to avoid errors or mistaken function calls. For example, !isnan(x) does boolean negation !(isnan(x)), not (!isnan)(x), you can check with Meta.@lower.

bkamins · December 1, 2023, 6:44am

Roughly:

julia> foldl(+, x) / length(x)
Float16(0.0)

tecosaur · December 1, 2023, 7:27am

Ah yea, oops yea I meant a “unary ? operator”. Pity the operator precedence doesn’t work out nicely here, I suppose the corrected example would actually be (?mean)([1, missing, 2]) which isn’t as nice, but still seems decent to me.

jules · December 1, 2023, 8:51am

Maybe the other way around then, ?name for a name (although that’s maybe a bit weird given the existing naming rules) and name? for a postfix operator which will come before a call name?(args...).

Topic		Replies	Views
What workflows for missing values are more ergonomic in Julia? Internals & Design	2	363	November 30, 2023
Compute mean of array where all values could be missing New to Julia	5	389	April 21, 2021
DataFrames, aggregate with missings Data dataframes	2	555	May 4, 2020
Using `isnan()` with missing values leads to hard to find bugs General Usage	6	515	April 12, 2020
Missing of a certain data type General Usage	5	485	February 15, 2019

Why are missing values not ignored by default?

Related topics