Why is missing defined as singleton

Currently type Missing is defined as struct Missing end
and the singleton instance of that type is defined as
const missing = Missing()

But what if we defined the following:

struct Missing
   why::Int
end
Missing() = Missing(1)
const missing = Missing()

missing would no longer be a singleton, but now we could capture
the reason why a value is missing. For example

const NI   = Missing(1)    # no information
const INV  = Missing(11)   # invalid
const DER  = Missing(111)  # derived
const OTH  = Missing(112)  # other
const NINF = Missing(1121) # negative infinite
const PINF = Missing(1122) # positive infinite
const UNC  = Missing(113)  # uncoded
const MSK  = Missing(12)   # masked
const NA   = Missing(13)   # not applicable
const UNK  = Missing(14)   # unknown
const ASKU = Missing(141)  # asked but unknown
const NASK = Missing(142)  # not asked
const NAVU = Missing(143)  # not available
const NAV  = Missing(1431) # temporarily unavailable
const QS   = Missing(144)  # sufficient quantity
const TRC  = Missing(145)  # trace
const NP   = Missing(2)    # not present

myvalue = ASKU
if ismissing(myvalue) ...

Back to the original question: why has missing been defined as a singleton?

for what i read here, when you have an x = Vector{Union{Missing,Float64}} a second array is created, of the same size, holding information of when the value at position x[i] is missing or not. using the approach would imply storing that information somewhere, with the subsequent cost.
The answer to your question is mencioned explicitly:

One of Julia’s strengths is that user-defined types are as powerful and fast as built-in types. To fully take advantage of this, missing values had to support not only standard types like Int , Float64 and String , but also any custom type. For this reason, Julia cannot use the so-called sentinel approach like R and Pandas to represent missingness, that is reserving special values within a type’s domain. For example, R represents missing values in integer and boolean vectors using the smallest representable 32-bit integer ( -2,147,483,648 ), and missing values in floating point vectors using a specific NaN payload ( 1954 , which rumour says refers to Ross Ihaka’s year of birth). Pandas only supports missing values in floating point vectors, and conflates them with NaN values.
In order to provide a consistent representation of missing values which can be combined with any type, Julia 0.7 will use missing , an object with no fields which is the only instance of the Missing singleton type. This is a normal Julia type for which a series of useful methods are implemented. Values which can be either of type T or missing can simply be declared as Union{Missing,T} . For example, a vector holding either integers or missing values is of type Array{Union{Missing,Int},1} :

2 Likes

@longemen3000 thank you for the link to the blog article. It seems an opportunity was lost. Where the 2nd array indicates that the value is missing, the corresponding value in the 1st array is uninitialized. That is a pity. That slot in the 1st array could have been used to indicate why the data is missing. Maybe this design could be revisited at some point in the future.

1 Like

There are some problems with storing a payload in my opinion with that method:

  1. missings in any type implies the use of missings on struts with variable length of bits (as Julia 1.3, there is a minimum length of 8 bits to make a type, but that can change in the future.

  2. The missing propagation proposed here sounds a lot like NaN propagation. NaNs are generated on numerical errors,like log(-3.0). Here it makes sense to create a Nan with a payload representing the error and propagate those. As far as I know, Julia does not have NaN propagation, and I don’t know about any existing implementation on popular libraries of NaN propagation apart from sentinel values. Furthermore, NaN propagation can only be used with Floats.

  3. For what I understand, the missing concept is different from the NaN concept. The first is a notion of statistical missingness, whereas the presence of a NaN definitely represents a numerical error somewhere on your numerical program.

With that said, still an alternate implementation can be done. Where the payload is in the remaining 7 bits of the hidden array (using 1 but to represent true or false)

1 Like

You could probably use this package to represent reason for missingness?

Not really — you can just define your own type for missingness with a payload, implement the relevant methods, and it will be equally efficient as an implementation in Base would be. You can of course take advantage of existing functions (eg ismissing, etc) by defining methods for them. That’s a great thing about Julia: there is no “secret sauce”, user code is on equal footing with “built-in” constructs.

That said, I think the main reason to be cautious about missing values with a payload is that you have to define semantics for all operations, eg Missing(1) + Missing(2). I don’t think there is a canonical way of doing this, so I think that not doing this in Base was the right choice: the details should be worked out in a package.

4 Likes

Yes, there is no canonical way to propagate missing values, but I believe that the current implementation would return missing in all cases. In my example above that would equate to the default Missing(1) # No Information, which seems like a reasonable approach to me. But I can see why someone would like to be more sophisticated with respect to the propagation of missing values.

Your idea is interesting! It’s been discussed before. Having missing values that represent “do’t know”, “not asked” etc. would be really useful, particularly for the analysis of household surveys. I understand this is also really important with medical data.

Stata has this functionality with a .r and .d missing types, which are encoded as sentinel values.

I think this would be a good feature to have, but in the implementation it would be tough to keep performance.

Additionally, your implementation with sentinal values is a bit odd. If it were to be implemented it would probably be via an abstract type

abstract type AbstractMissing end
struct Missing <: AbstractMissing end
struct DontKnowMIssing <: AbstractMissing end
1 Like

Actually I had in mind the extension to typed missing values when we implemented missing. It’s just not been a priority, since it sounds essential to get the basic support right first.

There are two ways to implement “typed” or “flavoured” missing values in a fully generic way (i.e. without sentinels): with one type for each kind of missing value as @pdeffebach suggested, or as a single TypedMissing type holding an Int field indicating the kind (as an enum) as @longemen3000 suggested. The advantage of the former is that the compiler can optimize the storage and computation of such values, so that they would be as efficient as Missing. The drawback is that it’s going to recompile functions for each kind of missing value, which is costly if you use several kinds.

That or those type(s) could be defined in any package (Missings.jl would be a good home). Though currently there’s a limitation as Julia Base uses x === missing instead of ismissing(x) in some places for performance reasons (see this issue), so Julia will have to be adapted a bit for all operations to work (notably == on arrays).

FWIW, there’s at least one standard providing a list of possible kinds/reasons for missingness: http://www.hl7.org/fhir/v3/NullFlavor/cs.html

7 Likes