I read the excellent write-up on the new Missing type and the value missing that will be more fully adopted in Julia 0.7, replacing the use of Nullable arrays (yeah!). I see that there are many concerns with R’s and other languages use of special sentinel values to stand for missing because these values can get caught in value comparisons unintentionally or, worse, even affect other mathematical operations (or suffer a performance hit for functions must have special case tests for the sentinel value).
Given that, I cautiously ask something that might be very wrong. What would be so terrible as using NaN for missing in floating point columns of a DataFrame? Yes, it’s a form of sentinel value. Yes, it might very rarely occur as the result of calculations (how rarely?). But, it doesn’t require a special type or type union when eltype is Float. It is already supported. Any function that supports Float64 (32, etc…) will continue to work. It propagates. It can be tested for. One can filter it out. It’s performance hit is either negligible (or we’re already used to it because we’ve already got it).
A big argument against this naive idea (does it have enough merit to even warrant being an idea?) is that each type will need it’s own sentinel value, which makes testing for missing data across a row of heterogeneous types really strange. Missing should be missing whether string, Int, Float, or category. But, with this hack Int would need -Inf. String is really tricky. “”, an empty string, is not necessarily missing. For some applications an empty string could reasonably signify missing. In other applications the empty string could be a perfect valid value. Categorical data would need something that means “can’t use me” regardless of the defined categories. Bool would need yet another sentinel value. And on and on… So, maybe I’m out of luck on this.
Maybe I’ve talked myself out of this–and convinced no one. It just seems very “low level” or CS-y to require declarations of Union{Float64, Missing} (I do understand why allowing Missing should be optional, though not all posters concur). On the other hand, every time a new type is created–or its usage becomes much more prevalent–hundreds of new methods must be added to Base for every mathematical, logical, and other kinds of functions. This creates a big burden for the maintainers of Base and many other packages.
R is sort of a language, but in other ways it’s is more of a statistics “application” that has a scripting language grafted on. That results in some serious awkwardness for R as a general purpose programming language. But, that is very desirable in some ways: A domain expert in epidemiology (not me) may be capable of writing some code, wants to do some analysis and modeling to support her post-doc work, but has no formal training in CS. She’s likely to pick up R. It seems very desirable that Julia should be just as accessible to her to accomplish her work–certainly as accessible as Python or R.
Union{Float64, Missing} as an explicit type declaration doesn’t seem so accessible. Perhaps under the hood it has to be, but it seems the DataFrames package (and perhaps Plots and other packages) could be more “app-like” in providing many default behaviors and encodings that “just work” with missing data. Perhaps that is, indeed the intent behind Union{Int, Missing}, Union{String, Missing}, etc in that DataFrames could perhaps have a simple declaration that “turns on” Missing:
using DataFrames
DataFrames.usemissing(true)
and
using Plots
Plots.usemissing(true)
This has been a bit of a non-technical, off-base ramble. It just seems that strong typing tends to position Julia into more and more of a geeky corner even though so much of Julia is accessible and elegant–and for so many use cases explicit type declarations are often not even needed. This approach to Missing technically solves many underlying issues, but raises the risk–for Julia learners-- of becoming just another syntactic gotcha; another source of potential errors and unexpected error messages; another concept unrelated to epidemiology (or some other domain that heavily uses math, coding, statistics, modeling). Don’t know that there is any magic answer here… …but I think Julia learners are a very important category of people to welcome. We are all Julia learners to various degrees.