A modest `missing`s 2.0 proposal

This came up on Slack the other day and I want to formalize a little proposal here. This is not possible, in terms of SemVer or by the compiler, currently.

Problem 1:

names = ["Roger", missing]
names[2] + 5

The second line should error, because no one’s name should ever really be a number. But this will return missing and propagate.

Problem 2:

[1, 2] + missing

I think most people imagine missing to be scalar-like values, numbers, strings, etc. The fact that this doesn’t error is un-intuitive.

Solution:

When defining a type, either Abstract or not, the package writer “ops-in” to missingness.

struct T
    x
end
+(t::T, y) = t.x + y

struct S
    x 
end
+(s::S, y) = s.x + y

allow_missing!(S) # can now work with `missing`


T(1) + missing{T} # errors
S(1) + missing{S} # works

Now any function that accepts arguments of type S will also accept arguments of type missing{S} and they will propagate missing values.

We would define

allow_missing!(Number)
allow_missing!(AbstractString)
allow_missing(AbstractDate)

and then we would never have to worry about adding new methods to existing functions to propagate missing again.

There are obviously a lot of issues to work out. For instance, what about the following?

f(x, y) = (x, y)
f(missing{S}, 5)

clearly we don’t want the function to return missing. Rather, it should return (missing{S}, 5). Perhaps we can add a @propogate macro that will tell the function to pass along missing values. Or a @nopropogogate macro if the alternative is more common.

You’re aware of the “counterfactual return type” issue that @johnmyleswhite described in the Nullable Julep, right? The problem is that in practice the type of missing{T} can only be known via inference, which isn’t always guaranteed to give a concrete type.

4 Likes

I find your vision interesting. I agree the current behavior is not ideal and the behavior you desire seem better to me. However, it seems to me that following Julia philosophy of “nothing is special”, missing should be implemented with existing machinery (or create new machinery that will be open to the language users, not only used for missing specifically).

1 Like

Thanks for this link. This goes more in depth than I could hope to.

I know this was discussed at length pre 0.7. I just didn’t want this discussion to get totally lost in Slack.

I see your point. You don’t want inference to fail, get missing{Any} and then users get a MethodError and figure out at what point their missing propagation breaks down.

I do like missingness being something that the creator of a type opts into for the type as a whole, rather than method-by-method. Which isn’t addressed in the Julep.

I would separate out uniform handling of missing from parametric missing{T}. Uniform handling (i.e. automatic lifting) is definitely possible with macros. I didn’t finish implementing it in Volcanito, but you could make a good start by replacing all syntactic calls with the relevant branching code. It would end up looking a lot like the automatic TVL I already implemented in Volcanito: https://github.com/johnmyleswhite/Volcanito.jl/blob/master/src/query/expression_operations/passes/tvl.jl

I don’t think this is right if you come from the DB world. In particular, the SQL standard explicitly includes non-scalar column types like Array: https://crate.io/docs/sql-99/en/latest/chapters/10.html

1 Like

Thanks for this link.

I think the problem with my proposal is that we don’t want propagation to return missing a la passmissing outside of a few “end-result” functions, like +, & etc. So my proposal doesn’t really solve the problem of “which functions propagate”.

I agree it’s probably best to work on a macro that makes propagation of functions easier and more consistent. If there was some sort of @definepropogatemissing macro that would automatically define a function with a given signature for various combinations of Missing inputs, we would see more missing propogation by developers. A behavioral nudge, so to speak.

2 Likes

Perhaps I am an exception, but I prefer not to “imagine” these things, but look up the semantics. missing is explicitly intended to propagate.

Are you looking for nothing, which indeed errors in cases like this?

Or if neither of these serves your purposes, you can define your own type with the desired semantics — it should fit into the language like existing ones. There is no need to break the API for existing types.

1 Like

Tangentially to the discussion about typed missings, I think the current missing propagation in Base works on a too low abstraction level. It is impossible to know if sin(missing) should error or “propagate” based only on the function sin and the input argument missing. That is a decision that must be decided on a much higher level by the person doing the data analysis via e.g. skipmissing. It tries to automate away a problem that is fundamentally impossible to automate away.

In addition, the fact that every single function one wants to write with a restriction to some argument type is forced to explicitly opt in to missing propagation, even if all its parts are already missing propagating, is in my opinon the big nail in the coffin for this automatic missing propagation business. For example:

mysincos(x::AbstractFloat) = sin(x), cos(x)

does not work with an assumption of missing propagation, even though sin and cos do. This, to me, pretty much screams that what is needed is some tool that composes with existing functions written without missing propagation in mind. skipmissing is one of those, explicit lifting of functions is another.

5 Likes

I agree that sin(missing) always propogating even the the missing represents an unknown String is a bit of a problem.

But ultimately I don’t think it’s friendly to users to require constant use of skipmissing and passmissing. It would substantially complicate julia code working with data and defeats the purpose of missing propagation entirely.

Parsing AbstractFloat? as Union{AbstractFloat, Missing} is a solution for making “top-level” functions work nicer with missing, but doesn’t solve your concern about a missing representing a String.

I still think the best solution, currently, in 1.0, is to bite the bullet at add methods for missing when users point them out and there seems to be a reasonable use-case.

Users can’t rely on automatic missing propagation already because it is almost arbitrary what functions support it. If you use a math function from Base it probably supports but if you use something from SpecialFunctions it might not (who knows) and if you happen to use some small “wrapper function” like what I wrote above it won’t. String functions, in general, are not missing propagating for example. Basically, if you currently rely on automatic missing propagation, you just got lucky, it might be completely wrong, and a tiny change will break it, and then you have to go the skipmissing route anyway.

This is just something you fundamentally have to deal with. It doesn’t “complicate julia code”, the problem just has some inherent complexity.

That is very unlikely to happen.

1 Like

In 2.0, would it be possible to define Missing <: X = true for all x? And propagation be the default that package authors opt out of? If we want missing to propagate, then it should propagate, otherwise there isn’t a big benefit over just having a single nothing value which errors and then have lift functions for nothing. But Julia decided to not go that route.

I don’t think that is possible.

Personally, I think that would lead to more reliable code with fewer bugs. For example, the propagation of missing is documented as

missing values propagate automatically when passed to standard mathematical operators and functions

Can you write a robust code based on that? How much data analysis is made solely with “standard mathematical operators”? You already pretty much have to assume non-missing propagation and deal with it.

2 Likes

I don’t think many other popular data science platforms take such an approach with missings. And users coming from Stata and R frequently get frustrated by Julia because missings aren’t propagated as much as they are in those languages.

It seems likely that data-focused packages like DataFramesMeta will use passmissing by default on functions. so that missing is propagated. However this will unfortunately result in a bifurcated ecosystem where missing is treated differently than in Base.

I guess, ultimately for 2.0 I would encourage Base devs to re-think the expectation that users put passmissing and skipmissing everwhere. I would really like to see more general propagation in Base in the future.

1 Like

I think there’s really only one clear solution here: the ecosystem that wants to believe like SQL and offer seamless automatic lifting everywhere needs to be based upon macros that lower down to code with explicit lifting. It’s not a coincidence that almost all of the code in Volcanito is focused on this kind of expression-level rewriting.

Julia doesn’t let you say:

For all f, f(::Missing) = missing

But that’s not such a big deal: it’s fairly trivial to to produce something like this with macros: replace all syntactic calls with the relevant branch. And there are parts of Julia (like short-circuiting Boolean operators) that will just never work unless you use macros because they can’t be extended at all.

4 Likes

Yes. Even more than that, I think one of the really nice things about the new DataFrames piping functions is that you pass a function to transform via src => fun => dest. With this, you don’t even need to re-write the call, you can just use src => passmissing(fun) => dest. In DataFramesMeta we would be able to catch things like src => (t -> fun.(t)) => dest to do passmissing on broadcasted functions as well.

1 Like

And there are parts of Julia (like short-circuiting Boolean operators)

Another general solution would be for the ability to declare custom unicode shortcuts. That way we can choose unicode characters and assign them \and<TAB> and \or<TAB>. That way we can take advantage of binary unicode operators without making the user remember specific commands.

EDIT: This was already done here via \And and \Or. If it’s possible to emulate short circuiting with a user-defined function, then I make a PR to Missings.jl to make this easier.

That doesn’t make && work with missing, right? So you still end up with a parallel universe to normal Julia IIUC.

2 Likes

Yes I think in the contet of DataFramesMeta this is best handled with a passmissing equivalent that returns false.

1 Like

This is very much a rough draft with a bunch of problems (handling zero argument functions, fixing some escaping issues, handling keyword arguments), but this gives you a sense what a @lift macro that just automatically does lifting could look like: https://gist.github.com/johnmyleswhite/24112cc02b93d30c2fe002d6115d03b4

3 Likes