Missing or NaN Data in GLM (e.g., in DataFrame, @formula)


#1

If a series contains missing or NaN values, R will ignore the missing observations in its lm command. This is convenient when one runs different models, which have different observations due to variations in whether the series with missing is included or not. (R is somewhat inconsistent, in that other functions (than lm), like mean, seem never to have heard about missing variables.)

How should end users deal with missing values in series? This is especially pertinent when they want to use formulas on data frames. Is the expectation that they create a new DataFrame for each lm?

advice appreciated.


#2

Missings.jl offers some convenience functions for dealing with missing values. I think the general consensus is that you are responsible for handling them on your own. So if you try to mean something that has missing values, you get an error by default, but you can use the provided functionality (like skipmissing) when passing arguments to easily remove the missings.


#3

it’s considerably more painful for new users. Fortunately, in the dataframe glm context, completecases can fix part of the problem. Unfortunately, it still leavies the “other” missing value NaN to be dealt with. (and completecases does not work for matrices.)

even then, every time a new model is run, it requires creating a new data frame, because the variable responsible for the missing content in one variable may or may not be in the model. I presume one needs to write a function that takes an AbstractDataFrame, copies it, and returns another sanitized DataFrame. Otherwise, I do not see how one can use the formula interface.

of course, it can be handled, but it is a roadblock compared to other stats packages I know of.

If the GLM calculations are aware of missing observations, this becomes much more convenient.


#4

uggh, no. this does not do it. the formula interface needs to be known to the data frame sanitizer.

so, I am out of my league here.

this is really problematic for using Julia in the social sciences, where missing values are very common, and data frames can be considered a basic necessary data type. all the nice formula stuff goes out of the window.

this shortcoming puts julia at a big disadvantage relative to R, Stata, etc., in this domain.


#5

I agree with iwelch on this. Regression tables report the number of observations so if a researcher adds variables in steps and the number of observations falls then it is easy for the reader to understand what is going on. Whether the population is the same between regressions because of an inclusion of an additional variable that has missing values is up for debate but this is a research question that arises transparently and explicitly from the regression table.

Some people don’t just explain the use of completecases() or dropna() etc, but consider them to be the only logical thing to do if you have missing values across different observations in different columns. I believe it’s because of different practices across disciplines. In my area this is definitely not the obvious thing to do most of the time.


#6

In Julia NaN values are not missing values. They correspond to an invalid number or to the result of an invalid computation such as 0/0. Missing values should be represented using missing, and observations with a missing value in one of the variables are skipped automatically by GLM. For example:

data = DataFrame(X=[1,2,3,5,10], Y=[2,4,missing,3,5])
ols = lm(@formula(Y ~ X), data)
nobs(ols) # Gives 4

If you have a dataset with NaN to represent missing values, apply replace(x, NaN => missing) to its columns after importing it.


#7

I think asking users to replace NaN with missings is ok.

mea culpa on the data. see, I concluded from

julia> lm( data[:,1] ~ data[:,2] )
ERROR: MethodError: no method matching ~(::Array{Union{Missing, Int64},1}, ::Array{Union{Missing, Int64},1})

that lm was not set up for missing data. incorrect.

so, somehow there is magic in the formula/dataframe interface to lm that would be nice to back-propagate into the plain lm.


#8

The problem with this approach is that in Julia ~ is a bitwise not. Julia has non-standard evaluation of expressions, but it is done through macros, which is why we use @formula.

If you are working with a dataframe, it’s not clear why you wouldn’t simply write lm(@formula(y ~ x), data). Or if you want to work with objects themselves, you can always write

lm([df.x1 df.x2], df.x3)

There are benefits to this distinction. It means that code is more clear. For example, lm(y ~ x) and lm(y ~ x, data) would mean two very different things if you had vectors y and x defined, while also having data.y and data.x defined.

One thing you frustratingly can’t do is use a Vector as your X in an lm call. Perhaps there should be an additional method that is

fit(model, X::AbstractVector, y) = fit(model, reshape(X, :, 1), y) # now X is a matrix

EDIT: For anyone frustrated by this, remember that X[:,:] will turn any Array{T, 1} into an Array{T, 2}.


#9

Just a general comment: in R NA and NaN are also separate things, what is worse you are not guaranteed what you will get as a result of operating on them as R documentation explains:

Computations involving NaN will return NaN or perhaps NA : which of those two is not guaranteed and may depend on the R platform (since compilers may re-order computations).

R handles this by implementing is.na function in the following way:

The default method for is.na applied to an atomic vector returns a logical vector of the same length as its argument x , containing TRUE for those elements marked NA or, for numeric or complex vectors, NaN , and FALSE otherwise.

It is easy enough to get the same in Julia:

is_na(x::Number) = ismissing(x) || isnan(x)
is_na(x) = ismissing(x)

if you wanted it.


#10

the problem is that the solution is not what one would expect, if someone has tried lm with ordinary numeric vectors and matrices. once someone knows the solution is just to place it into a data frame and then it works, it is just fine.

there is something to be said for getting what one expects…


#11

One problem with that approach is that many people expect a variety of different things, this leads to an exponential growth of interfaces to support. Using ?lm and then seeing that the way to get a regression with arrays is lm(X, y) doesn’t seem like a high barrier to entry.