Loess with missing values

I want to use the Loess smoothing with missing values in y (and then fill those with the loess estimate). I hope I would be able to do it with pairwise from StatsBase, but did not manage.

The code below works but feels a bit cumbersome:

using Loess

function loess4Miss(x, y; span=0.3)  
  ok = findall(@. !ismissing(x) & !ismissing(y))
  model = loess(disallowmissing(x[ok]), disallowmissing(y[ok]); span=span, degree=1)
  predict(model, x)
end

x = rand(100);
y = sin.(x*π) + 0.03*randn(100)
y = allowmissing(y)
y[10] = missing
loess4Miss(x,y)

Is there a better solution, e.g. with pairwise?

Can you please open an issue for this in Loess.jl as I also think requiring AbstractFloat eltype is a bit cumbersome.

As a small comment: your code will not work when x has missing value as predict fail I think (not tested).

What I typically do is the following:

using DataFrames
function loess4Miss(x, y; span=0.3)  
  df = dropmissing(DataFrame(x=x, y=y, copycols=false), :y)
  model = loess(df.x, df.y; span=span, degree=1)
  predict(model, x)
end

(in general working on multiple columns is usually easier when using DataFrames.jl)

Thanks! - only the df2 (the “2”) is a typo.

I also found skipmissings, essentially very similar, without DataFrame, but would remove also x missings

  function loess4Miss(x, y; span=0.3)
    xok, yok = collect.(skipmissings(x, y))
    model = loess(xok, yok; span=span, degree=1)
    Loess.predict(model, x)
  end

Yes I think so too. Since x was a timeseries or index, that did not matter here.

skipmissings is similar indeed.

Note that if you have y with missing x and x with missing y you should really do the DataFrames version because you can be completely screwing up your pairings otherwise

Sorry, I don’t get it. I thought skipmissings removes casewise.

I believe the problem you refer to only happens with separate skipmissingfor each variable (of course).
cf. Skipmissing no working in cor function - #3 by pdeffebach

1 Like

Yes correct. That’s what I meant.

But I also read somewhere, that skipmissings (with s) will be discontinued? So should one use the DataFrame way? Looks a bit like a detour, but maybe I just need to get used to it. Any “general” advice welcome, e.g. @bkamins. Btw thanks for all your helpful comments and blogs.

There is a discussion abut a better design, but probably skipmissings will stay to avoid breaking changes.


My major comment is that I do not think that you have to use DataFrames.jl, but I would recommend you to use any table-aware storage type, so that it ensures synchronization of rows between several vectors if these vectors are logically connected out of the box (as this is what you essentially need here). Such design is in my opinion cleaner conceptually.

(in general I think that is why “data frame” concept become so popular everywhere - it makes thinking about such cases easier)

2 Likes