Missing or NaN

I read the excellent write-up on the new Missing type and the value missing that will be more fully adopted in Julia 0.7, replacing the use of Nullable arrays (yeah!). I see that there are many concerns with R’s and other languages use of special sentinel values to stand for missing because these values can get caught in value comparisons unintentionally or, worse, even affect other mathematical operations (or suffer a performance hit for functions must have special case tests for the sentinel value).

Given that, I cautiously ask something that might be very wrong. What would be so terrible as using NaN for missing in floating point columns of a DataFrame? Yes, it’s a form of sentinel value. Yes, it might very rarely occur as the result of calculations (how rarely?). But, it doesn’t require a special type or type union when eltype is Float. It is already supported. Any function that supports Float64 (32, etc…) will continue to work. It propagates. It can be tested for. One can filter it out. It’s performance hit is either negligible (or we’re already used to it because we’ve already got it).

A big argument against this naive idea (does it have enough merit to even warrant being an idea?) is that each type will need it’s own sentinel value, which makes testing for missing data across a row of heterogeneous types really strange. Missing should be missing whether string, Int, Float, or category. But, with this hack Int would need -Inf. String is really tricky. “”, an empty string, is not necessarily missing. For some applications an empty string could reasonably signify missing. In other applications the empty string could be a perfect valid value. Categorical data would need something that means “can’t use me” regardless of the defined categories. Bool would need yet another sentinel value. And on and on… So, maybe I’m out of luck on this.

Maybe I’ve talked myself out of this–and convinced no one. It just seems very “low level” or CS-y to require declarations of Union{Float64, Missing} (I do understand why allowing Missing should be optional, though not all posters concur). On the other hand, every time a new type is created–or its usage becomes much more prevalent–hundreds of new methods must be added to Base for every mathematical, logical, and other kinds of functions. This creates a big burden for the maintainers of Base and many other packages.

R is sort of a language, but in other ways it’s is more of a statistics “application” that has a scripting language grafted on. That results in some serious awkwardness for R as a general purpose programming language. But, that is very desirable in some ways: A domain expert in epidemiology (not me) may be capable of writing some code, wants to do some analysis and modeling to support her post-doc work, but has no formal training in CS. She’s likely to pick up R. It seems very desirable that Julia should be just as accessible to her to accomplish her work–certainly as accessible as Python or R.

Union{Float64, Missing} as an explicit type declaration doesn’t seem so accessible. Perhaps under the hood it has to be, but it seems the DataFrames package (and perhaps Plots and other packages) could be more “app-like” in providing many default behaviors and encodings that “just work” with missing data. Perhaps that is, indeed the intent behind Union{Int, Missing}, Union{String, Missing}, etc in that DataFrames could perhaps have a simple declaration that “turns on” Missing:

using DataFrames
DataFrames.usemissing(true)

and

using Plots
Plots.usemissing(true)

This has been a bit of a non-technical, off-base ramble. It just seems that strong typing tends to position Julia into more and more of a geeky corner even though so much of Julia is accessible and elegant–and for so many use cases explicit type declarations are often not even needed. This approach to Missing technically solves many underlying issues, but raises the risk–for Julia learners-- of becoming just another syntactic gotcha; another source of potential errors and unexpected error messages; another concept unrelated to epidemiology (or some other domain that heavily uses math, coding, statistics, modeling). Don’t know that there is any magic answer here… …but I think Julia learners are a very important category of people to welcome. We are all Julia learners to various degrees.

2 Likes

Not rarely. Pretty much any operation which doesn’t have a true answer (or limit) creates a missing, like 0/0. Those show up all of the time in code that has bugs, and so keeping it separate is crucial to keeping code safe.

One bug I’ve seen in code which uses NaN for missing is that in the analysis you always to “drop missing” for the mean/variance/etc. tests. If you use NaN and “drop NaNs”, you can pass all of your tests even when half of your simulations prematurely exit due to 0/0 -> NaN issues. In practice you were doing “drop NaNs” to drop all of the test cases with missing data, but in reality your tests drop test cases with missing data AND every case where your algorithm fails, so your algorithm “succeeds” by essentially a form of accidental rejection sampling and dropping every tough case. Ouch. While at face value this can be okay and not pass strict tests, it can lead to some very subtle biases in the true solution of your algorithm since you are doing a biased rejection sampling which you didn’t intend.

This happened to me one when as an undergrad using R. I just dropped NaNs and NAs all willy-nilly and then I couldn’t find out why my simulation was giving mostly correct but at the edges incorrect distributions. Turns out the NaNs were meaningfully different from the NAs and were indicative of a bug and the real fix was to only drop NAs and fix the bug causing NaNs. Luckily, I don’t think my code didn’t suffer from the “bug” that NA + NaN = NaN while NaN + NA = NA or vice versa issue. But I learned my lesson.

8 Likes

Compared to what? Keep in mind that one of the key features of Julia is that it makes generic code very easy to write. The few other languages provide something similar tend to be very complex (think C++).

So, for example, your domain expert in epidemiology can code up a likelihood function for a model just assuming that the input data is Union{Missing, Real}. Then, if someone is interested in eg identification, they can using ForwardDiff, which would feed ForwardDiff.Dual numbers in place of Float64, and hopefully it will “just work”.

This is a great advantage, and should be well worth the extra effort of learning about a new concept. That said, learning Julia does entail learning some CS concepts compared to, say, R. The manual is written in a way to provide a gentle introduction, and hopefully as the language matures there will be many more sources such as books and blog posts.

2 Likes

I welcome the introduction of Missing, although I must admit that NaNs have been good enough for what I do (econometrics, financial calculations). I still have not decided whether to switch over or not.

Union{Float64, Missing} etc. is pretty daunting for many users. We clearly need a bunch of convenience routines for easily setting up and manipulating arrays with missings, without having to go through DataFrames. Do we already have that?

1 Like

String’s not the only tricky one. -Inf doesn’t exist for integers, only for floating point numbers. You can make it:

> typemin(Int)
-9223372036854775808

But then when you add that to something else you can get overflow:

> typemin(Int) - 10
9223372036854775798

So you can say anything with a sufficiently large norm is a missing, but then if you add enough values you’ll accidentally create fake missing values. Even worse, arithmetic between two missings defined like this will give you a non-missing value, since for example

> (typemin(Int) - 10) - typemin(Int)
-10

So I don’t think there is a clear definition of a sentinel for integers.

What basically happened here and in your stream of conscious

is that you ended up thinking that you need an extra bit to represent missingness in order to do it correctly, which is what the blog post explains is being done:

The second improvement consists in using a compact memory layout for Array object whose element type is a Union of bits types, i.e. immutable types which contain no references (see the isbits function). This includes Missing and basic types such as Int, Float64, Complex{Float64} and Date. When T is a bits type, Array{Union{Missing,T}} objects are internally represented as a pair of arrays of the same size: an Array{T} holding non-missing values and uninitialized memory for missing values; and an Array{UInt8} storing a type tag indicating whether each entry is of type Missing or T.

This layout consumes slightly more memory than the sentinel approach, as the type tag part occupies one byte for each entry. But this overhead is reasonable: for example, the memory usage of an Array{Union{Missing,Float64}} is only 12.5% higher than that of an Array{Float64}. Compared with the sentinel approach, it has the advantage of being fully generic (as we have seen above). Actually, this mechanism can be used in other situations, for example with Union{Nothing,Int} (which is the element type of the array returned by indexin in Julia 0.7).

The rest of missing is just an interface over this idea. I think the interface can and will get improvements over time, but the general idea of how it should be handled at the compiler level and the array-representation level is solved: the rest is just writing some nice Julia code to make the abstraction easier to use.

Also, there were/are some ideas to make Float64? equivalent to Union{Float64,Missing}.

I think there might be cases, where NaN is just sufficient as missing value.
My package GitHub - oheil/NormalizeQuantiles.jl: NormalizeQuantiles.jl implements quantile normalization is using NaN as a general placehodler for any missing number. But there are some constraints to the algorithms:

  • number of missing values in the data should be small and of random nature

and the result of the algorithm is always an Array{Float64,2} which makes it straight forward to use NaN.

So a special case of a valid usage of NaN as “missing” proofs somehow the need of Missing, otherwise it wouldn’t be a special case :grin:

You would loose the ability to distinguish between (floating point) value not representable and not available. This is the same as in a R data.frame with NaN and NA.

I would claim that she is currently likely to choose R because the statistic packages are broad and often top notch, scope of environment (packages) is much larger, all the collegues use R (or Python) and because there are books which cover all standard operations. And that’s fine. Accessibility I don’t see. In contrary, for me Julia is more coherent and the syntax has almost textbook like quality (as you say, explicit typing is optional). She might choose Julia in future if she needs the performance, has higher demands for code quality and the Julia environment (packages) has grown a bit more.

What don’t you like in Missing? If you use R you also must know about NAs? And such NAs can be tricky, see e.g. this post invalidated again by this with NA^0 giving 1 or NA || TRUE giving TRUE. Without CS training a bit unexpected… :wink: – For me Missing is a ‘dead-easy’ concept, either we have a value or it is missing; much simpler than the former Nullable types (we agree here).

1 Like

Same scenario here in cancer research. R packages are widely used, in this sense they are accessible. But its not always well understood what the package is doing in detail, in this sense accessibility failed.

Special details like NAs/NaN are just ignored in general. If you ask (bio-)informaticians or the typical R using biologist about NAs there is no awareness about that. They will stumble upon it using mean(big.list) and solve it using mean(big.list,na.rm=TRUE) and thats it. I am not sure if this is a good or a bad thing.

For julia i guess it is a big thing just because the Type concept is predominant.

Missing values are documented at the very beginning of the R intro manual, in the second (and first substantive) chapter. They show up similarly in most books about R, and most functions which do something special about NAs talk about them (eg mean). I don’t know about bioinformaticians, but pretty much anyone who reads any documentation about R encounters the concept, it is very hard to avoid.

I would say that the difficulty about NA in R is not user awareness, but

  1. semantics (even seasoned R users run into bugs which involve obscure corner cases),
  2. the need to special-case implementations (eg most of mean.default is branches dealing with NA, before calling the internal C routine, this gets worse with multi-argument functions),
  3. lack of lifting mechanisms for NA, but this is rarely noticed because very few R programmers use structures, and R does not have generic vectors.

After a lot of exploration and dedicated hard work by developers, I would say that Julia’s Missing solves all of these problems very nicely. Since using Julia means learning about the type system anyway, I think the small extra price of understanding Union{Missing,T} is worth it.

4 Likes

Maybe I am missing something (;-), but could we not just include a number of “aliases” like
MFloat64 = Union{Float64,Missing}
by default. Using MFloat64 as a type if data can be missing does not seem so daunting, and everything should just work, right …

Precisely, this is the kind of stuff we need.

And perhaps also
(1) zerosM(dims) = zeros(MFloat64,dims) and a few other array constructors
(2) a simple way of dropping all rows of a matrix if there is a missing in the row, which would be something like doing v = any(ismissing.(x),dims=2) followed by x[.!v,:], but hopefully there is a smarter way
(3) …please add your favourite

I think there was some talk about having Int64? === Union{Missing, Int64}. But it may not be the best use of that punctuation.

You are right that the language has all the facilities needed to make the new approach a bit more accessible: type aliases!

Thanks, all. Everyone has been very constructive.

It’s pretty clear that the sentinel idea can’t work unless the only data you care about is Float, because NaN is, at best, “ok”. For other types there is no easy way.

I thought I was clear that the underling Union{Real, Missing} (etc., for other types) is substantively an improvement over sentinel values used by other languages, as the the great writeup on Missing in Julia 7.0 made very clear.

I heartily agree that Julia is often much more accessible. It is just painful when I have to go back to np.array([[…], […]]) in Python. I can’t even do it right without referring to a cheatsheet. I would really like to see more researchers, analysts, engineers, and other quantitative domain experts enjoy working in Julia.

Of the many helpful comments I thought that using type aliases was most apropos. We need several, but a variety of package authors–including the maintainers of Base–might agree on a set of these that could become convention, as there is great flexibility in Julia to do things “your way”.

The choice of these aliases is somewhat arbitrary as obviousness is in the eye of the beholder. They might be:
Mreal
Mstring
Mbool
Mcategory for Union{Missing, CategoricalArrays.CategoricalArray{String,1,UInt32,String,CategoricalArrays.CategoricalString{UInt32},Union{}}} ==> wow, this one is messy and NOT general
etc.

There might be a convenience function in the Missings package like:

Missings.setmissing("all")
Missings.setmissing("number")

for just a few–or another complication/opaqueness is introduced–to create the conventional type aliases when one wishes to use missing values with various types. Said convenience function for categorical data could be in CategoricalArrays–this user knows she wants categorical data. It could be an argument of the constructor for a new set of category values as in:

cv = CategoricalArray(v, missing=true)

Note this is only to make it easier to use missing regardless of whether using DataFrames, or JuliaDB, or just arrays. There is no regret for the passing of NullableArrays or for the performance improvements that make Union{Real, Missing} more appealing. Anyone “skilled in the art” can just give these “simplifications” a pass and use the explicit type declarations needed. The great thing is having a solution superior to what we had in <= 6.x.

How can I help contribute something actually helpful without creating some kind of mess?

Wouldn’t it be much nicer with something general, like T? that works for any T, as has been mentioned in the past? I’m not a fan of a bunch of arbitrary MFloat etc.

Are there any other candidates for what T? could mean?

2 Likes

Union{T, Nothing} would also be quite useful (it’s the return type of a few Base functions like findfirst).

1 Like

That might not be so easy… The “great writeup on Missing in Julia 7.0” by which you probably mean the First-Class Statistical Missing Values Support in Julia 0.7 blog post linked above by Chris already covered some/most of the issues mentioned here. I.e. “We are fully aware that Union{Missing,T} is quite verbose for those using missing values in daily work. The T? syntax has been discussed as a compact alternative…”. But then “it is not clear yet whether it would be more appropriate to attribute this syntax to Union{Missing,T} or to Union{Nothing,T}. It is therefore currently reserved waiting for a decision.”

As hinted in the blog under ‘Acknowledgements’ this was a multi-year effort by many and at this time I believe we just have to wait for 1.0 proper and then, should there be rough missing edges, point those out, try to help etc.

1 Like

Not quite as nice, but I tend to include in all my code touching missing values the alias: Maybe{T} = Union{Missing,T}. If you wanted maximum compactness with this approach, you could even do M{T} = Union{Missing,T}. Then the aliases discussed above go from MReal to M{Real}, which isn’t bad at all IMO.

4 Likes

I don’t think it’s quite true. Or if it is, it’s a user problem (not reading the documentation) instead of a language issue. What I like in the ˋUnion{T,Nothing}ˋ etc syntax is that it forces you to think about the missing etc values. I think it gives more control to the user, which should ultimately be the goal of a language.

1 Like

Yes, my statement is a little bit offhand. Clearly they know about NA, but there is no awareness about the special issues and differences which comes with it. Like difference between NA and NaN. Talking only about me now (to not offend anybody): Only with my occupation with julia I started to form out some awareness about this. First when I found out that NaN is a Float but there is nothing similar for Int. Before there was only NA or NaN, doesn’t matter, or it was null, ok with me, nothing to bother with. Thats what I meant with “no awareness”.