for what i read here, when you have an x = Vector{Union{Missing,Float64}} a second array is created, of the same size, holding information of when the value at position x[i] is missing or not. using the approach would imply storing that information somewhere, with the subsequent cost.
The answer to your question is mencioned explicitly:
One of Julia’s strengths is that user-defined types are as powerful and fast as built-in types. To fully take advantage of this, missing values had to support not only standard types like Int , Float64 and String , but also any custom type. For this reason, Julia cannot use the so-called sentinel approach like R and Pandas to represent missingness, that is reserving special values within a type’s domain. For example, R represents missing values in integer and boolean vectors using the smallest representable 32-bit integer ( -2,147,483,648 ), and missing values in floating point vectors using a specific NaN payload ( 1954 , which rumour says refers to Ross Ihaka’s year of birth). Pandas only supports missing values in floating point vectors, and conflates them with NaN values.
In order to provide a consistent representation of missing values which can be combined with any type, Julia 0.7 will use missing , an object with no fields which is the only instance of the Missing singleton type. This is a normal Julia type for which a series of useful methods are implemented. Values which can be either of type T or missing can simply be declared as Union{Missing,T} . For example, a vector holding either integers or missing values is of type Array{Union{Missing,Int},1} :
@longemen3000 thank you for the link to the blog article. It seems an opportunity was lost. Where the 2nd array indicates that the value is missing, the corresponding value in the 1st array is uninitialized. That is a pity. That slot in the 1st array could have been used to indicate why the data is missing. Maybe this design could be revisited at some point in the future.
There are some problems with storing a payload in my opinion with that method:
missings in any type implies the use of missings on struts with variable length of bits (as Julia 1.3, there is a minimum length of 8 bits to make a type, but that can change in the future.
The missing propagation proposed here sounds a lot like NaN propagation. NaNs are generated on numerical errors,like log(-3.0). Here it makes sense to create a Nan with a payload representing the error and propagate those. As far as I know, Julia does not have NaN propagation, and I don’t know about any existing implementation on popular libraries of NaN propagation apart from sentinel values. Furthermore, NaN propagation can only be used with Floats.
For what I understand, the missing concept is different from the NaN concept. The first is a notion of statistical missingness, whereas the presence of a NaN definitely represents a numerical error somewhere on your numerical program.
With that said, still an alternate implementation can be done. Where the payload is in the remaining 7 bits of the hidden array (using 1 but to represent true or false)
Not really — you can just define your own type for missingness with a payload, implement the relevant methods, and it will be equally efficient as an implementation in Base would be. You can of course take advantage of existing functions (eg ismissing, etc) by defining methods for them. That’s a great thing about Julia: there is no “secret sauce”, user code is on equal footing with “built-in” constructs.
That said, I think the main reason to be cautious about missing values with a payload is that you have to define semantics for all operations, eg Missing(1) + Missing(2). I don’t think there is a canonical way of doing this, so I think that not doing this in Base was the right choice: the details should be worked out in a package.
Yes, there is no canonical way to propagate missing values, but I believe that the current implementation would return missing in all cases. In my example above that would equate to the default Missing(1) # No Information, which seems like a reasonable approach to me. But I can see why someone would like to be more sophisticated with respect to the propagation of missing values.
Your idea is interesting! It’s been discussed before. Having missing values that represent “do’t know”, “not asked” etc. would be really useful, particularly for the analysis of household surveys. I understand this is also really important with medical data.
Stata has this functionality with a .r and .d missing types, which are encoded as sentinel values.
I think this would be a good feature to have, but in the implementation it would be tough to keep performance.
Additionally, your implementation with sentinal values is a bit odd. If it were to be implemented it would probably be via an abstract type
abstract type AbstractMissing end
struct Missing <: AbstractMissing end
struct DontKnowMIssing <: AbstractMissing end
Actually I had in mind the extension to typed missing values when we implemented missing. It’s just not been a priority, since it sounds essential to get the basic support right first.
There are two ways to implement “typed” or “flavoured” missing values in a fully generic way (i.e. without sentinels): with one type for each kind of missing value as @pdeffebach suggested, or as a single TypedMissing type holding an Int field indicating the kind (as an enum) as @longemen3000 suggested. The advantage of the former is that the compiler can optimize the storage and computation of such values, so that they would be as efficient as Missing. The drawback is that it’s going to recompile functions for each kind of missing value, which is costly if you use several kinds.
That or those type(s) could be defined in any package (Missings.jl would be a good home). Though currently there’s a limitation as Julia Base uses x === missing instead of ismissing(x) in some places for performance reasons (see this issue), so Julia will have to be adapted a bit for all operations to work (notably == on arrays).
FWIW, there’s at least one standard providing a list of possible kinds/reasons for missingness: NullFlavor - FHIR v4.0.1
I’d like to bump this 3-year old comment for visibility.
There have been multiple instances where I’ve wanted functions to dispatch on special cases in missing. Variables that are missing due to being out-of-sample (such as for unbalanced panels) are one such case.
If
abstract type AbstractMissing end
struct Missing <: AbstractMissing end
And all functions currently defined on ::Missing were instead defined on ::AbstractMissing, then it would have made other code I’ve written much simpler. As things are, I need to either do type piracy (a big no no) or I need to create a new type and manually extend many other functions that currently handle missings.
Can we put this on the todo list for future versions of Julia? It should be a non-breaking change given how multiple dispatch works.
I would be happy to make a new type: struct MyMissingType <: AbstractMissing end
and then extend methods for only the few relevant special cases for this type.
Using Holy Traits and defining functions on abstract types makes extension much simpler.
If, for example, I wanted skipmissing to work with MyMissingType as it does with the default Missing, this would be easy if the helper functions for skipmissing were defined on AbstractMissing. I would have to do nothing at all.
But because there is no AbstractMissing type, I would need to reimplement all functions currently defined on Missing to also work with MyMissingType. This is time consuming and results in a large amount of boilerplate. In practice, I do not reimplement all functions, which means that MyMissingType does not support all operations for Missing types. Lazy.jl can help automate this, but in my experience it is not terribly reliable and is not as clean as simply subtyping an abstract type. These problems are acceptable if I am writing code I will not share with others, but is not so great if I intend to write a general-use package. More generally, it is not consistent with julia’s multiple dispatch paradigm (and it’s “unreasonable effectiveness”) for me to write an independent missingness type hierarchy when base Julia already supports missingness. It’s much better to subtype an extant abstract type.
The power of multiple dispatch is one of the things I love about Julia. Unfortunately, when base julia creates concreate types that are not subtypes of abstract types (or does not define generic functions in terms of the abstract type), a lot of the power of multiple dispatch is lost.