Interpolation for missing data:

Hi everyone:

some background:
I have a questions about data interpolation. I have some GPS station data which is missing data. This can occur for various reason (Software updates, hardware swaps, etc…). However, these gaps in data affect my analysis and visualizations.

For example, if you look at this figure below, you’ll notice a sharp spike in a few of the regions right around the 2005 mark. There are others too, as you can see, but the most egregious one is my focus right now with the hopes of correcting that will improve the others.

I’ll try to be as succinct as I can with the explanation of the process. My first step is to take a windowed average for each GPS station individually. My second step is to group each GPS station into unique bins based on their geographic location. Third step is to now take the average of each windowed average value for each GPS station contained within the bins.

Here is the problem:
For a few of the regions, at specific epochs, there are only a handful of unique data points; where sometimes there is only 1 stations active in a bin. Of the 371 stations used in my study, if there is only 1 station being used for the average, and this value happens to be an outlier compared to other regions, it creates a rather chaotic plot instead of being “smoothed out” by other stations in the bin.

My attempted solution:
I want to interpolate for the missing values to create a smoother plot. Right now I am trying to utilize the Interpolations.jl package, but, It looks “off”. Now I don’t think it’s going to look “natural” or anything like that, but, I was wondering if anyone has any advice or insight for a more robust interpolation?

My current workflow (simplified):
(1) compare missing data/dates to a full list of dates:

all_years = readdlm("/path/to/file/julia/mjd_decyr.txt")[:, 1]
all_years = sort(all_years)

(2) join data to identify gaps for interpolation:

df_full = leftjoin(all_years, gps_df, on = :decyr, makeunique=true)

(3)

# set up interpolation function using existing data
interp_function = LinearInterpolation(df.decyr, df.value, extrapolation_bc=Line())

# Interpolate
interpolated_values = interp_function.(df_full.decyr)

From there I feed those values back into the DataFrame and write to a file.

Here is the pre-interpolated data with the gaps:

And here is the interpolated plot:

The biggest issue the end of my plot that is completely skewed. Secondary is that the interpolation created a “step” in the data. Which, I can live with given that is the extent of what the Iterpolations.jl package can do.

But, if anyone has better implementations they wouldn’t mind sharing, I would be extremely grateful!

Thank you in advance! :slight_smile:

Not sure if it suits here, but if you have multiple variables for a certain location, you may want to impute the missing values based on some sort of proximity to the other locations, using some ML imoutation model rather than interpolation.

1 Like

Look for smoothing interpolation with gaps: GitHub - kbarbary/Dierckx.jl: Julia package for 1-d and 2-d splines or GitHub - gher-uliege/DIVAnd.jl: DIVAnd performs an n-dimensional variational analysis of arbitrarily located observations for example

A good way to get ideas would be to take the pre-interpolated data from OP, dump it into a CSV file, and link to it from this thread (the data from BMRY). The people on this thread could then try out their suggestions.

As often repeated in this forum, the easier it is to get something to work on for the community, the better the response.

1 Like

Solid idea. I will edit my OP to contain the BMRY (pre-interpolated) data, as well as the years file for the gaps.

EDIT: looks like I can’t upload csv’s here. I’ll have to link it through JuliaHub, it looks like. I can’t do at this very moment, but I’ll try to do it later tonight!

Maybe the example from here Spatiotemporal · Impute.jl is of interest?

Am I looking at this really wrong? In the pre-interpolated plots, the windowed average lines look similar but the points are clearly different between “vertical” and “east”. But the interpolated plots’ points look identical, except the “east” graph shifts most of the points left to cover the gap (hence the step) and an odd line shows up on the right end. Comparing the two plots, it looks like the interpolated plot’s data is the “east” data from the pre-interpolated plot.

Assuming that the interpolated plot is actually the “east” data before and after interpolation, it’s consistent with linear interpolation if df_full.decyr is actually out of order; move the shifted points back, then move the odd line and the few trailing data points into the gap, there’s the linear interpolation for the “east” data. It’s hard to tell without a MWE where that shift happened.

You may want to try the short code from Daniel Mühlbachler’s archived Whittaker.jl package (in old Julia, but quick to fix).

FWIW, for this input data:

using Random
t = 2001:0.003:2007
N = length(t)
v = 5*cospi.((t .- 2000)/2) + 5*rand(N)
v[2002.6 .< t .< 2003.4] .= NaN

the second order whittaker2() function produces these results:

See interesting Python examples of the method here.

2 Likes

Actually the smoothness criteria from Whittacker is a special case for the one used in DIVAnd.

using Plots
using DIVAnd
using LinearAlgebra
using SparseArrays
using Statistics


using Random
t = collect(2001:0.003:2007)
N = length(t)
v = 5*cospi.((t .- 2000)/2) + 5*rand(N)
v[2002.6 .< t .< 2003.4] .= NaN
# Treat non NaN points as data:
td=t[.!isnan.(v)]
vd=v[.!isnan.(v)]


# Mask, also controls the end-point behaviour
mask = trues(size(t))
#mask[1]=false
#mask[end]=false

# Grid metrics
pm = ones(size(t)) / (t[2]-t[1]);

# Tow parameters controlling the smoothness and data-fitting
len=2.0
epsilon2=0.3

# Analysis works on anomalies, so subtract mean (or even better trends)
meany=mean(vd)

# Analyse
fi,s=DIVAndrun(mask,(pm,),(t,),(td,),vd.-meany,len,epsilon2)

# Add the mean
fi.=fi.+meany

# Normal smoothing interpolation
scatter(td, vd, ms=1)
plot!(t, fi, lw=2)

and results look quite similar :smiley: