for what i read here, when you have an x = Vector{Union{Missing,Float64}} a second array is created, of the same size, holding information of when the value at position x[i] is missing or not. using the approach would imply storing that information somewhere, with the subsequent cost.
The answer to your question is mencioned explicitly:
One of Julia’s strengths is that user-defined types are as powerful and fast as built-in types. To fully take advantage of this, missing values had to support not only standard types like Int , Float64 and String , but also any custom type. For this reason, Julia cannot use the so-called sentinel approach like R and Pandas to represent missingness, that is reserving special values within a type’s domain. For example, R represents missing values in integer and boolean vectors using the smallest representable 32-bit integer ( -2,147,483,648 ), and missing values in floating point vectors using a specific NaN payload ( 1954 , which rumour says refers to Ross Ihaka’s year of birth). Pandas only supports missing values in floating point vectors, and conflates them with NaN values.
In order to provide a consistent representation of missing values which can be combined with any type, Julia 0.7 will use missing , an object with no fields which is the only instance of the Missing singleton type. This is a normal Julia type for which a series of useful methods are implemented. Values which can be either of type T or missing can simply be declared as Union{Missing,T} . For example, a vector holding either integers or missing values is of type Array{Union{Missing,Int},1} :
@longemen3000 thank you for the link to the blog article. It seems an opportunity was lost. Where the 2nd array indicates that the value is missing, the corresponding value in the 1st array is uninitialized. That is a pity. That slot in the 1st array could have been used to indicate why the data is missing. Maybe this design could be revisited at some point in the future.
There are some problems with storing a payload in my opinion with that method:
missings in any type implies the use of missings on struts with variable length of bits (as Julia 1.3, there is a minimum length of 8 bits to make a type, but that can change in the future.
The missing propagation proposed here sounds a lot like NaN propagation. NaNs are generated on numerical errors,like log(-3.0). Here it makes sense to create a Nan with a payload representing the error and propagate those. As far as I know, Julia does not have NaN propagation, and I don’t know about any existing implementation on popular libraries of NaN propagation apart from sentinel values. Furthermore, NaN propagation can only be used with Floats.
For what I understand, the missing concept is different from the NaN concept. The first is a notion of statistical missingness, whereas the presence of a NaN definitely represents a numerical error somewhere on your numerical program.
With that said, still an alternate implementation can be done. Where the payload is in the remaining 7 bits of the hidden array (using 1 but to represent true or false)
Not really — you can just define your own type for missingness with a payload, implement the relevant methods, and it will be equally efficient as an implementation in Base would be. You can of course take advantage of existing functions (eg ismissing, etc) by defining methods for them. That’s a great thing about Julia: there is no “secret sauce”, user code is on equal footing with “built-in” constructs.
That said, I think the main reason to be cautious about missing values with a payload is that you have to define semantics for all operations, eg Missing(1) + Missing(2). I don’t think there is a canonical way of doing this, so I think that not doing this in Base was the right choice: the details should be worked out in a package.
Yes, there is no canonical way to propagate missing values, but I believe that the current implementation would return missing in all cases. In my example above that would equate to the default Missing(1) # No Information, which seems like a reasonable approach to me. But I can see why someone would like to be more sophisticated with respect to the propagation of missing values.
Your idea is interesting! It’s been discussed before. Having missing values that represent “do’t know”, “not asked” etc. would be really useful, particularly for the analysis of household surveys. I understand this is also really important with medical data.
Stata has this functionality with a .r and .d missing types, which are encoded as sentinel values.
I think this would be a good feature to have, but in the implementation it would be tough to keep performance.
Additionally, your implementation with sentinal values is a bit odd. If it were to be implemented it would probably be via an abstract type
abstract type AbstractMissing end
struct Missing <: AbstractMissing end
struct DontKnowMIssing <: AbstractMissing end
Actually I had in mind the extension to typed missing values when we implemented missing. It’s just not been a priority, since it sounds essential to get the basic support right first.
There are two ways to implement “typed” or “flavoured” missing values in a fully generic way (i.e. without sentinels): with one type for each kind of missing value as @pdeffebach suggested, or as a single TypedMissing type holding an Int field indicating the kind (as an enum) as @longemen3000 suggested. The advantage of the former is that the compiler can optimize the storage and computation of such values, so that they would be as efficient as Missing. The drawback is that it’s going to recompile functions for each kind of missing value, which is costly if you use several kinds.
That or those type(s) could be defined in any package (Missings.jl would be a good home). Though currently there’s a limitation as Julia Base uses x === missing instead of ismissing(x) in some places for performance reasons (see this issue), so Julia will have to be adapted a bit for all operations to work (notably == on arrays).