Missingness versus Undefinedness

Has anyone in the Julia community tackled the issue of distinguishing missing data from undefined data, e.g. in the design of the DataFrames NA or the generic Nullable, or in some other context?

Specifically, I’m thinking of something like this. Suppose I have raw data from some study that is missing some values, and I compute some statistic on it. In a typical implementation, where there is just some NA value to indicate missingness and undefinedness, you might see data like this:

time  result
1     1
2     2        
3     3
4     NA

Now let’s say I want to compute some statistic. The result is undefined when the missing value is included in the inputs.

time  result  rolling mean
1     1       1
2     2       1.5
3     3       2
4     NA      NA

My problem with this is that the NA in the column result means something different from the NA in column rolling mean. From a reproducibility and verification standpoint, the first NA really means “The data were known to be missing when collected; there was no error, mistake, undefined result, or other unexpected case that caused this value to be populated as NA. This is a known unknown.”

By contrast, the second NA really means “We don’t know anything about this value. We just have no way of reasoning about it. If we copied this table from some other source and loaded it into a database, we couldn’t say whether the NA was here because there was some error in the copying process and some value couldn’t be parsed, or whether it was intended to say ‘there is no applicable result for this value.’”

I encounter a similar situation often when working with raw data from data vendors. They sometimes use the word NULL or an empty string to mean something like “default case applies” or “false.” For example you might see

id   flag
1    NULL
2    NULL
3    TRUE
4    NULL

or

id   flag
1
2
3    TRUE
4

So in this case, both NULL and “” serve double duty, confusing whether they are there because “We don’t know what the value was supposed to be” or “We know what the value was supposed to be and it was supposed to be false / absence of evidence / etc.”

I’m wondering if anyone working on DataFrames.jl or on Julia in general has written some sort of motivated case for defining or not defining a “Known to be unknown” value or type as distinct from a just “Unknown Unknown” value/type as it seems Nullable{T} is.

There has been discussion of this and related ways of considering the epistemology of absence and which sorts of real-word distinctions in unawareness, ambiguity, and uncertainty carry the more computationally useful power. I recall reading a cogent take on this by @johnmyleswhite and a few others some years ago. As to the more general question, this is something on which people have spent energy and there is a small boatful of academic papers looking at various aspects of this with respect to data, to comprehension, to inference et cetera. Even Boolean truth values are amenable to some kind of meta-interpretational relevance-valience: (False xor True) rox (Missing/Unobserved/Unknown/Unkowable aor DoNotCare/Irrelevant/Nonparticipating).

1 Like

Do you have a link to the @johnmyleswhite cogent take (or, less preferably, any of the academic papers)? Thanks!

I do not recall the thread – someone else may.

Perhaps Jeffrey has this talk in mind: John Myles White: What needs to be done to move JuliaStats forward - YouTube

I would actually not go down that route and instead defer to the ideas I gathered from a meeting with some of the C# developers about their experiences introduce missing values into a language: https://github.com/JuliaLang/julia/pull/19034#issuecomment-265931467

Although I’m very sensitive to the complexity behind these issues, I’m broadly ok with the idea that NULL should behave in Julia like it does in SQL. I believe that’s what we’re aspiring to now. My current thinking is here: https://github.com/JuliaLang/Juleps/pull/21

2 Likes

In the database world I had to deal with this for years, and frequently things are done poorly, confusing different states such as the empty string (i.e. known to not exist), undefined or missing (not yet initialized), known to be unknown, or invalid. I always had problems with that in Spain, with databases that insisted on people having two last names always, or Spaniards in the US, with databases that insist that people have a middle name (or initial) (which is not frequent in Spain).

I do hope that @johnmyleswhite’s proposal can be included in time for v1.0, as it seems so far to be the best thought out approach that I’ve seen so far for Julia.

1 Like

@johnmyleswhite and @ScottPJones --this is all great and exactly what I was hoping for (a reasoned treatment of the considerations)–thanks! (And @jsarnoff; thanks for kicking it off.)