Why can't we merge Missing and Nothing?

I get that Missing and Nothing serves different purposes. They both encode that there is no particular value available, but the former indicates that usage is permitted as long as it is consistent with the multivalued state space (e.g, true || missing = true) while the latter indicates that usage is not permitted (most operations on nothing fail). A slightly contrived example of these two usecases are found within mathematics:

  • no particular real value can be assign to sqrt(-1) because none of them solve x^2 = -1 (so one may return nothing or fail if restricted to reals).
  • no particular real value can be assign to 0/0 because they all solve 0*x = 0 (so one may return missing or fail).

One may observe that a single float-point value, NaN, covers both cases in the IEEE standard. I love the fact that Missing and Nothing are independent from other types (unlike NaN), but why do we need two of them? Can you find examples where merging the two types into one would cause any significant difficulty or confusion?

EDIT
Rough summary can be found at post #20.

With missing you opt into three valued logic (like you gave as an example). I never want that because it would hide bugs in my code, so I use nothing.

The whole Missing story could kind of be a package (it used to be) with the exception that it is used in some cases like find on arrays which might have missing values and you would need to do type piracy to support that. Nothing on the other hand is quite a bit more fundamental and is ingrained much more deeply in the language (even being a part of the iteration protocol) and being the implicit return value.

6 Likes

I do not know the age of the third participant, it is missing.
print is called for its side-effect, it returns nothing.

4 Likes

Since you understand the fact that missing is different from nothing and why this is so (see docs), I assume you’re asking for a concrete example:
I have a bunch of temperatures and the date I measured them at. I want to analyze these. But I messed up the measurement in a couple of those, and I don’t have the temperatures for those dates (say the thermometer broke). My analysis includes checking if the temperature is larger than some threshold and if it was taken after a certain date. Without missing you wouldn’t be able to run that error-free:

using Dates
temp = [2,5,3,7,missing,1]
dt = Date.(2000 .+ (1:length(temp)))
[(t > 3) | (d > Date(2000)) for (t,d) in zip(temp, dt)]

@kristoffer.carlsson came closest to a satisfying answer. In many cases you could merge them into NotAvailable:

  • I do not know the age of the third participant, so it is notavailable .
  • print is called for its side-effect, its return value is notavailable .

My (perhaps naive) take on it was that perhaps it should not be the type, but the way you use it, that determines whether it behaved as Nothing or Missing. Just like you have skipmissing to tell how Missing should be treated.

There is an acronym for that, “DWIM” (do what I mean).

1 Like

Nahh, scratch that. I would rather have the caller actively choosing whether something should behave like Nothing or Missing. Transferring this choice of interpretation to the callee sounds like trouble. The type is a contract.

The simplest answer is that there were enough people like me that wanted something to behave like missing does, and enough people like Kristoffer that said having nothing behave this way would be terrible (this is probably the most suscint encapsulation of the difference is desired behavior).

But you want a specific example where we need both:

1 + missing # missing

m = match(r"foo", "bar") # nothing

1 + m # error

I would not want an unmatched regular expression to return missing (or notavailable), because the result is known. There’s no match. And in case I do something stupid later like try to add a number something to it, I want to be told I’m doing something stupid rather than letting it propagate.

13 Likes

Say I develop a package to generate data from some source (e.g., scanning webpages). When a data point cannot be fetched, should I mark it using Missing or Nothing? It seems the generator of data, not the user of this data, decides which operations are allowed on it?

My take on it would be that the user should either generate the data with a specific behavior in mind (parsing the desired type), or generate it using Nothing (the default) and then convert to Missing where appropriate. Can this conversion be done fast? Is this the most appropriate solution?

I’d say that it depends on whether you generally expect to get the data point. If this a simple query with an a priori unknown result (which it sounds like), you should return nothing, just like findfirst and companions do for, say, arrays.

That’s largely up to you. If you want to mark it for trying later, for example, nothing, or a custom object with timestamps, URIs, http status codes, and similar data could be a reasonable choice.

If you want to treat it the same way as a non-response in a survey, missing could work.

I hear that one should generally use nothing (or other appropriate types) to mark failure in data capture. Even non-response in survey could be given this type initially, as one would then be able to distinguish cases for which we are still hoping for a response (no data point is a failure) from cases where we have accepted that the response will never become available (missing). Ultimately the data capture must be rejected or finalized for long-term storage (in which case all unavailable data points must be missing).

Conclusively, the natural process for maturation of a newly generated dataset is from Union{T,Nothing} to Union{T,Nothing,Missing} to Union{T,Missing}. Any comments? I personally feel that this should be written down somewhere (anyone up for a blogpost?) with examples showing the cleanest way to make it happen.

The classic distinction between missing and nothing is epistemological vs. ontological missingness. That is, missing represents something that has a value (but you don’t know it), whereas nothing represents something that simply has no value.

There’s some more details here, although it’s mixed with implementation design:

https://julialang.org/blog/2018/06/missing

6 Likes

The way nothing is used in man Julia protocols illustrates this: if you call findfirst(x, a) and x does not appear in a at all then what index can you return? There is no correct index, so findfirst returns nothing. If on the other hand, we knew that x occurs somewhere in a but we don’t know where, then you would want to represent that with missing.

1 Like

Let me illustrate what bothers me about having to distinguish “has a value, but it is unknown” and “has no value” in programming:

  1. You may not know. If there is no information about siblings, you do not know whether to assign NameOfSister to missing (“has a value, but it is unknown”) or a nothing (“has no value”).

  2. No standard for optional arguments because the types are loaded with meaning:
    # Input, if you have a coupon
    buy(..., coupon::Union{String,Nothing}=nothing)
    # Input, if you know the DNA of your mom.
    healthanalysis(..., mom::Union{DNA,Missing}=missing)

  3. Production of missing types into codes that may not be programmed to handle it (kudos to @kristoffer.carlsson). Would you dare to define the generally intractable isconvex(function) or the generally undecidable ishalting(callable) as a Union{Bool,Missing} even though they are definitely of the “has a value, but it might be unknown” type? I never want that because the three-way-logic would hide bugs in my code.

What is more appropriate in my opinion is to distinguish “value not available (if it exists, it is unknown)” and “definitely has a value, but it is unknown”. The former would be the standard goto type for all items in the enumeration above, whereas the second would be a specialization you opt into when needed. In practice, this is how I see Nothing and Missing already being used today.

You should not think of nothing and missing as solutions that cover all possible scenarios. They are intended to work in some very common ones, but you are free to design your own more elaborate extensions or alternatives — what’s nice about Julia is that they will be given equal treatment (compiler optimizations, etc).

API design and data encoding are both hard problems. You should think of Julia not as a solution, but as a toolbox to iteratively build a solution.

As for your particular questions: it is hard to say more without context, but

  1. I would not define NameOfSister at all. Perhaps a siblings accessor, the elements of which I could then query with name and gender, which could return missing.

  2. For optional arguments, I would use nothing if that would take me to a different branch (explicitly, in the code), missing if I wanted to rely on generics. But that’s just my own style and I don’t follow it consistently.

  3. I would not define isconvex and ishalting at all, because I don’t know how to implement them (for generic objects). In fact, I would not define function return types at all, that’s what we have a compiler for.

2 Likes

It is pretty customary to represent an optional never-nothing field by Union{Nothing, Typ}, while representing partial data (e.g. unfetched data) as Union{Missing, Typ}. For optional fields that can be nothing, a common idiom is Maybe{Typ}. The third customary way of representing non-present data is by sentinel values. The built-in sentinel value for gc-managed types is nullpointer.

However, nullpointer is not officially encouraged for this purpose and pretty annoying, because it only works for gc-controlled types (not for bitstypes), which leaks abstractions and can break on upgrades, and you need to check isassigned before access, and you need to manipulate pointers to “unassign” a field or array entry. Nevertheless, nullpointer is the most performant way and is emitted by inner constructors that don’t set all fields, as well as Array{T,N}(undef, sz). An alternative to nullpointers is to use an explicit sentinel value like const _nil = Typ(args...) and then check if foo.bar === _nil (you want triple equality here).

Given these customs, the answer would be missing or a more specific type like struct FetchError code::Int32 end that encodes information about the error (e.g. network error, parsing error, successfully fetched and parsed but pruned at later times).

If in doubt, use nothing, is would I would suggest. Even if you theoretically could have NameOfSister as missing it would still likely hide bugs by allowing you to run 1+NameOfSister and false && NameOfSister without getting an error.

1 Like

I’m unclear in any of your cases what’s gained by having only one option.

This is true - is it important to distinguish between them? If so, you need 2 types, if not, pick one and document it.

This is surely better than types that have no meaning at all. If nothing (heh) is meant by nothing, why would it be any more sensible to use as a default argument? Again, if that’s your only option, you’re no better off using it than you are under the current state of things, and you foreclose the possibility of a type that propogates for those of us that need it.

This seems like a bug, and you should raise an issue in the offending package. Or, if missing should not be passed to your method, don’t define methods that allow for missing. Users that accidentally generated missings upstream will be hit with MethodErrors (this happens to me all. the. time), and then will need to make decisions about how to handle them.

Again, I don’t see how this problem is solved by only having nothing. In fact, it seems like I am (or people like me are) more likely to monkey patch something to handle nothing in a way that propogates and then silently break a bunch of code that expects nothing to have a specific meaning.

2 Likes

Since this thread is now a wall of text and many things were not said clearly enough, I made a monologue to summarize our discussion so far:

Me in post #1: I get that Nothing and Missing serve different purposes, but can we somehow fulfill these difference usecases with just one type?
You in post #2: No.
Me in post #7: Ahh, I get it. The type is a behavioral contract. I accept the premise of two distinct types.
Me in post #9: But can we then distill concrete guidelines for when exactly one should use one or the other type for a specific purpose?
Me in post #15: The classic interpretation to use Missing when has a value (but it is unknown) and Nothing when has no value is definitely flawed for the listed reasons.

Many good inputs to the discussion so far…

1 Like