That’s an interesting thought. I’m not sure if using ?
as an identifier is a viable solution, but I’m much more a fan of adding a function than adding new syntax from a design perspective (leaving aside my appreciation of predicate?
functions), which I find strong motivation for exploring the potential of ?
as an identifier over ?
as new syntax if possible.
We can allow ?
as a general purpose postfix operator, it doesn’t have to be only for missings. Then Missings.jl would just be one package making use of it.
It could look like this:
- Allow
?
as a postfix operator like'
(but without any method definition in Base, initially at least) - In Missings.jl define
?
as a shorthand forskipmissing
(this could also go in a new ShortMissings.jl package or whatever) - Fix
mean
to work well withskipmissing
(cf this comment) - Add overloads to
cor
, etc. to properly support things likecor(skipmissing(a), skipmissing(b))
- Add support for skipmissing-boolean indexing in DataFrames.jl
At this point we can do things like this:
mean(x?)
cor(x?, v?) # same behavior as polars or pandas
df[(df.x .> 0)?, :]
But let’s go further:
-
In Missings.jl define
?(f::Function)
to make wrappers that skip missing values:cor?(x,v)
would meanMissingSkipper(cor)(x,v)
which would eventually callcor(x?, v?)
It’s a nice shortcut but especially useful for cases like this:
combine(gdf, :value => mean?)
Here’s a working prototype with 'ˢ
instead of ?
:
struct MissingSkipper{T}
f::T
end
var"'ˢ"(x::AbstractArray) = skipmissing(x)
var"'ˢ"(f::Base.Callable) = MissingSkipper(f)
(s::MissingSkipper)(args...; kwargs...) =
(s.f)(skipmissing.(args)...; kwargs...)
x = [1, 2, missing, 3]
julia> mean(x'ˢ)
2.0
julia> mean'ˢ(x)
2.0
# would work if cor(x::SkipMissing, y::SkipMissing) was defined:
julia> cor'ˢ(x, x)
I see.
Is it possible that Statistics implements either (both) of the naive algorithms, but not something more numerically stable, such as
function avg(x)
a = zero(eltype(x))
for i in eachindex(x)
a += (x[i] - a) / i;
end
return a
end
?
julia> x = rand(Float16, 10^6);
julia> mean(x)
NaN
julia> avg(x)
4.92920e-01
I think the f?(x)
looks pretty clean and is certainly easier to type, but I do have some reservations about adding a third meaning to ?
. Imagine trying to teach someone Julia and saying, if you type a question mark it like this
julia> ?f
it means get help for f
. If you type it like this
d = a ? b : c
it means return b
if a
is true. But if you type it like this
julia> f?(x)
it means skip missing values when applying f
.
Perhaps this isn’t too hard to reason about, but I could imagine someone constructing a pathological case where they are trying to skip missing values in a ternary operator and have ?
all over the place.
Yes - you can see the implementation in the source code of Statistics.jl.
Thanks, the links from the documentation are broken!
One does wonder why mean(itr)
does not use the numerically stable streaming algorithm. (Your implementation can be generalized to work for non-indexable iterators.) It seems to work quite nicely:
julia> x = rand(Float16, 10^6);
julia> mean(x)
NaN16
julia> avg(x)
Float16(0.498)
With a more sane definition of the mean of an iterator,
function avg(itr)
a = zero(eltype(itr))
for (i, x) in enumerate(itr)
a += (x - a) / i
end
return a
end
we get
julia> let x = rand(Float32, 100_000_000), sx = skipmissing(x)
mean(x), mean(sx), avg(sx)
end
(0.4999912f0, 0.16777216f0, 0.4999802f0)
With the MissingSkipper
wrapper proposed above, the behavior of adding skipmissing
to every argument would just be a fallback. The wrapper function can be overloaded to have the proper behavior for each function:
(s::MissingSkipper{typeof(quantile)})(x, p; kwargs...) =
quantile(skipmissing(x), p; kwargs...)
Then quantile?(x, p)
would work as expected.
But if ?
supports both vectors and functions as proposed then we could also just write
quantile(xs?, [0.1, 0.2])
reduce(+, xs?)
We should add a mean(itr::SkipMissing)
method which calls sum(x -> coalesce(x, zero(x)), itr.x)
so that pairwise summation is used. The actual definition will be a little trickier as mean
accumulates values in the return type to avoid overflows (e.g. Float64
when input values are Int8
).
EDIT: The problem is that contrary to arrays, we don’t know the number of elements in advance, so we can’t just call sum
. As for iterators, we need to counts elements and accumulate them at the same time.
But the underlying issue is not SkipMissing
, it’s the implementation of mean
for arbitrary iterators. It affects other iterators besides just SkipMissing
:
julia> A = rand(Float32, 100_000_000);
julia> itr = (rand(Float32) for _ in 1:100_000_000);
julia> mean(A), mean(itr)
(0.5000529f0, 0.16777216f0)
Why can’t we have a streaming mean
implementation for arbitrary iterators that has better numerical accuracy? I’ve opened a Github issue about it:
You don’t want cor(x?, y?)
to expand to cor(skipmissing(x), skipmissing(y))
, but to cor(skipmissings(x, y)...)
where skipmissings
is the function defined in Missings.jl. More generally, f(x?, y, z?)
needs to expand to something like
_x, _z = skipmissings(x, z)
f(_x, y, _z)
The point is that the set of elements to skip is a joint property of all the iterators participating in the skipping, so you can’t simply apply skipmissing
to each iterator individually. I’m not sure you can accomplish this with a postfix operator alone, but probably with a macro given parser support.
What you propose is one way to do it, but it’s hard-coding the skip behavior before the call to cor
which is less flexible than my version.
If we expand cor(x?, y?)
to cor(skipmissing(x), skipmissing(y))
, then we can overload cor(x::SkipMissing, y::SkipMissing)
to do the right thing: the equivalent of skipmissings
as you say, or maybe something else if so instructed by a keyword argument.
Note that skipmissing(x)
is not doing anything except putting a wrapper around x
. So you can do it to individual arguments, send that to the function and leave the function to deal with it properly. I think in some cases skipmissings
is not the right thing to do, so it’s good to have the flexibility of dealing with each argument as appropriate in each particular function.
Why not? It even works for functions like quantile(x, p)
where only one argument should get skipmissing
, as described here.
Edit: mmm maybe what I say works technically but violates the semantics of skipmissing
? Then it could be solved by using a different wrapper type?
There’s a certain sort of analogy here with broadcasting. We could’ve similarly chosen to have each argument opt into a broadcast with something like f(.x, y, .z)
instead of f.(x, Ref(y), z)
. Or we could’ve gone even more explicit and had a way of tagging which dimensions should be “extruded” (or repeated) out. While there’s things to be desired with the status quo, it does cover most of the typical use-cases.
In other words, it could make sense to do the “all arguments” thing and maybe have a simple way of “excepting” some arguments from consideration. The goal here — like broadcasting — would be a simple shorthand that gets most folks most of the way there for most usages.
I suppose what you could do without macros is make the MissingSkipper
function wrapper a flag to change the semantics of SkipMissing
to act jointly across arguments rather than individually. In that case, you would probably not be interested in cor(x?, y?)
, but rather cor?(x?, y?)
. That’s a lot of question marks, but it could work.
But @mbauman makes a good point with the comparison to broadcasting and the argument for keeping it simple. If f?(args...)
means f(skipmissings(args)...)
, you could always write
myquantiles(x) = quantile(x, [0.1, 0.5, 0.9])
myquantiles?(y)
I’d rather change the documentation of skipmissing(itr)
from
Return an iterator over the elements in
itr
skippingmissing
values. The returned object can be indexed using indices ofitr
if the latter is indexable. […]
to
Return a
SkipMissing
wrapper foritr
. The result is an iterator that skipsmissing
values. Functions can dispatch on theSkipMissing
type to implement appropriate behaviors and algorithms for handling missing values. The returned object can be indexed using indices ofitr
if the latter is indexable. […]
Then we can use SkipMissing
(the already existing wrapper used by skipmissing
) as a flag, and both cor?(x, y)
and cor(x?, y?)
would make perfect sense.
A 3rd useful option is skipping an iteration if missing
is found in only some of the iterables, not all.
No, I don’t think this would be useful, because we want fun?(x, y)
to work even if the person who wrote fun
didn’t ever think about missing
. If ?
is included, then Julia performs some reasonable pre-processing to make it work.
Isn’t that exactly what skipmissings
does?