Loess with missing values

I want to use the Loess smoothing with missing values in y (and then fill those with the loess estimate). I hope I would be able to do it with pairwise from StatsBase, but did not manage.

The code below works but feels a bit cumbersome:

using Loess

function loess4Miss(x, y; span=0.3)  
  ok = findall(@. !ismissing(x) & !ismissing(y))
  model = loess(disallowmissing(x[ok]), disallowmissing(y[ok]); span=span, degree=1)
  predict(model, x)
end

x = rand(100);
y = sin.(x*π) + 0.03*randn(100)
y = allowmissing(y)
y[10] = missing
loess4Miss(x,y)

Is there a better solution, e.g. with pairwise?

Can you please open an issue for this in Loess.jl as I also think requiring AbstractFloat eltype is a bit cumbersome.

As a small comment: your code will not work when x has missing value as predict fail I think (not tested).

What I typically do is the following:

using DataFrames
function loess4Miss(x, y; span=0.3)  
  df = dropmissing(DataFrame(x=x, y=y, copycols=false), :y)
  model = loess(df.x, df.y; span=span, degree=1)
  predict(model, x)
end

(in general working on multiple columns is usually easier when using DataFrames.jl)

1 Like

Thanks! - only the df2 (the “2”) is a typo.

I also found skipmissings, essentially very similar, without DataFrame, but would remove also x missings

  function loess4Miss(x, y; span=0.3)
    xok, yok = collect.(skipmissings(x, y))
    model = loess(xok, yok; span=span, degree=1)
    Loess.predict(model, x)
  end

Yes I think so too. Since x was a timeseries or index, that did not matter here.

skipmissings is similar indeed.

Note that if you have y with missing x and x with missing y you should really do the DataFrames version because you can be completely screwing up your pairings otherwise

Sorry, I don’t get it. I thought skipmissings removes casewise.

I believe the problem you refer to only happens with separate skipmissingfor each variable (of course).
cf. Skipmissing no working in cor function - #3 by pdeffebach

1 Like

Yes correct. That’s what I meant.

But I also read somewhere, that skipmissings (with s) will be discontinued? So should one use the DataFrame way? Looks a bit like a detour, but maybe I just need to get used to it. Any “general” advice welcome, e.g. @bkamins. Btw thanks for all your helpful comments and blogs.

There is a discussion abut a better design, but probably skipmissings will stay to avoid breaking changes.


My major comment is that I do not think that you have to use DataFrames.jl, but I would recommend you to use any table-aware storage type, so that it ensures synchronization of rows between several vectors if these vectors are logically connected out of the box (as this is what you essentially need here). Such design is in my opinion cleaner conceptually.

(in general I think that is why “data frame” concept become so popular everywhere - it makes thinking about such cases easier)

3 Likes

Could this be done in StatsModels.jl? We can add missing_action, and it can be useful for all kinds of models. Just like R has na.action.

Something like:

glm(@formula(y ~ x, missing_action = :skip), df, Binomial(), LogitLink());

Then we can implement StatsModels API in Loess.

loess(@formula(y ~ x, missing_action = :skip), df);

Or, perhaps like this:

loess(@formula(y ~ x), df; missing_action = :skip);

Is the na.action mechanism really useful in R? I’ve never seen people really using it, as the default behavior of skipping missing values seems to be enough for everyone.

In Julia, adding a skipmissing::Bool argument like it exists for several functions would be enough IMO. See for example Propagation of missing values · Issue #496 · JuliaStats/GLM.jl · GitHub.

skipmissing should be enough in most cases. I think it would be useful to have it in a consistent way perhaps through StatsModels.

1 Like