Has anyone in the Julia community tackled the issue of distinguishing missing data from undefined data, e.g. in the design of the DataFrames NA or the generic Nullable, or in some other context?
Specifically, I’m thinking of something like this. Suppose I have raw data from some study that is missing some values, and I compute some statistic on it. In a typical implementation, where there is just some NA
value to indicate missingness and undefinedness, you might see data like this:
time result
1 1
2 2
3 3
4 NA
Now let’s say I want to compute some statistic. The result is undefined when the missing value is included in the inputs.
time result rolling mean
1 1 1
2 2 1.5
3 3 2
4 NA NA
My problem with this is that the NA
in the column result
means something different from the NA
in column rolling mean
. From a reproducibility and verification standpoint, the first NA
really means “The data were known to be missing when collected; there was no error, mistake, undefined result, or other unexpected case that caused this value to be populated as NA
. This is a known unknown.”
By contrast, the second NA
really means “We don’t know anything about this value. We just have no way of reasoning about it. If we copied this table from some other source and loaded it into a database, we couldn’t say whether the NA
was here because there was some error in the copying process and some value couldn’t be parsed, or whether it was intended to say ‘there is no applicable result for this value.’”
I encounter a similar situation often when working with raw data from data vendors. They sometimes use the word NULL
or an empty string to mean something like “default case applies” or “false.” For example you might see
id flag
1 NULL
2 NULL
3 TRUE
4 NULL
or
id flag
1
2
3 TRUE
4
So in this case, both NULL
and “” serve double duty, confusing whether they are there because “We don’t know what the value was supposed to be” or “We know what the value was supposed to be and it was supposed to be false / absence of evidence / etc.”
I’m wondering if anyone working on DataFrames.jl or on Julia in general has written some sort of motivated case for defining or not defining a “Known to be unknown” value or type as distinct from a just “Unknown Unknown” value/type as it seems Nullable{T}
is.