Interpolation for missing data:

Hi everyone:

some background:
I have a questions about data interpolation. I have some GPS station data which is missing data. This can occur for various reason (Software updates, hardware swaps, etc…). However, these gaps in data affect my analysis and visualizations.

For example, if you look at this figure below, you’ll notice a sharp spike in a few of the regions right around the 2005 mark. There are others too, as you can see, but the most egregious one is my focus right now with the hopes of correcting that will improve the others.

I’ll try to be as succinct as I can with the explanation of the process. My first step is to take a windowed average for each GPS station individually. My second step is to group each GPS station into unique bins based on their geographic location. Third step is to now take the average of each windowed average value for each GPS station contained within the bins.

Here is the problem:
For a few of the regions, at specific epochs, there are only a handful of unique data points; where sometimes there is only 1 stations active in a bin. Of the 371 stations used in my study, if there is only 1 station being used for the average, and this value happens to be an outlier compared to other regions, it creates a rather chaotic plot instead of being “smoothed out” by other stations in the bin.

My attempted solution:
I want to interpolate for the missing values to create a smoother plot. Right now I am trying to utilize the Interpolations.jl package, but, It looks “off”. Now I don’t think it’s going to look “natural” or anything like that, but, I was wondering if anyone has any advice or insight for a more robust interpolation?

My current workflow (simplified):
(1) compare missing data/dates to a full list of dates:

all_years = readdlm("/path/to/file/julia/mjd_decyr.txt")[:, 1]
all_years = sort(all_years)

(2) join data to identify gaps for interpolation:

df_full = leftjoin(all_years, gps_df, on = :decyr, makeunique=true)

(3)

# set up interpolation function using existing data
interp_function = LinearInterpolation(df.decyr, df.value, extrapolation_bc=Line())

# Interpolate
interpolated_values = interp_function.(df_full.decyr)

From there I feed those values back into the DataFrame and write to a file.

Here is the pre-interpolated data with the gaps:

And here is the interpolated plot:

The biggest issue the end of my plot that is completely skewed. Secondary is that the interpolation created a “step” in the data. Which, I can live with given that is the extent of what the Iterpolations.jl package can do.

But, if anyone has better implementations they wouldn’t mind sharing, I would be extremely grateful!

Thank you in advance! :slight_smile:

Not sure if it suits here, but if you have multiple variables for a certain location, you may want to impute the missing values based on some sort of proximity to the other locations, using some ML imputation model rather than interpolation.

2 Likes

Look for smoothing interpolation with gaps: GitHub - kbarbary/Dierckx.jl: Julia package for 1-d and 2-d splines or GitHub - gher-uliege/DIVAnd.jl: DIVAnd performs an n-dimensional variational analysis of arbitrarily located observations for example

A good way to get ideas would be to take the pre-interpolated data from OP, dump it into a CSV file, and link to it from this thread (the data from BMRY). The people on this thread could then try out their suggestions.

As often repeated in this forum, the easier it is to get something to work on for the community, the better the response.

1 Like

Solid idea. I will edit my OP to contain the BMRY (pre-interpolated) data, as well as the years file for the gaps.

EDIT: looks like I can’t upload csv’s here. I’ll have to link it through JuliaHub, it looks like. I can’t do at this very moment, but I’ll try to do it later tonight!

Maybe the example from here Spatiotemporal · Impute.jl is of interest?

Am I looking at this really wrong? In the pre-interpolated plots, the windowed average lines look similar but the points are clearly different between “vertical” and “east”. But the interpolated plots’ points look identical, except the “east” graph shifts most of the points left to cover the gap (hence the step) and an odd line shows up on the right end. Comparing the two plots, it looks like the interpolated plot’s data is the “east” data from the pre-interpolated plot.

Assuming that the interpolated plot is actually the “east” data before and after interpolation, it’s consistent with linear interpolation if df_full.decyr is actually out of order; move the shifted points back, then move the odd line and the few trailing data points into the gap, there’s the linear interpolation for the “east” data. It’s hard to tell without a MWE where that shift happened.

You may want to try the short code from Daniel Mühlbachler’s archived Whittaker.jl package (in old Julia, but quick to fix).

FWIW, for this input data:

using Random
t = 2001:0.003:2007
N = length(t)
v = 5*cospi.((t .- 2000)/2) + 5*rand(N)
v[2002.6 .< t .< 2003.4] .= NaN

the second order whittaker2() function produces these results:

See interesting Python examples of the method here.

2 Likes

Actually the smoothness criteria from Whittacker is a special case for the one used in DIVAnd.

using Plots
using DIVAnd
using LinearAlgebra
using SparseArrays
using Statistics


using Random
t = collect(2001:0.003:2007)
N = length(t)
v = 5*cospi.((t .- 2000)/2) + 5*rand(N)
v[2002.6 .< t .< 2003.4] .= NaN
# Treat non NaN points as data:
td=t[.!isnan.(v)]
vd=v[.!isnan.(v)]


# Mask, also controls the end-point behaviour
mask = trues(size(t))
#mask[1]=false
#mask[end]=false

# Grid metrics
pm = ones(size(t)) / (t[2]-t[1]);

# Tow parameters controlling the smoothness and data-fitting
len=2.0
epsilon2=0.3

# Analysis works on anomalies, so subtract mean (or even better trends)
meany=mean(vd)

# Analyse
fi,s=DIVAndrun(mask,(pm,),(t,),(td,),vd.-meany,len,epsilon2)

# Add the mean
fi.=fi.+meany

# Normal smoothing interpolation
scatter(td, vd, ms=1)
plot!(t, fi, lw=2)

and results look quite similar :smiley:

2 Likes

Yes, the whittaker2() function produces almost identical results to DIVAnd, except for minor differences at the beginning and end:

Btw, in your code snippet, the following packages do not seem to be used: LinearAlgebra, SparseArrays and Statistics.

1 Like

Update for those interested:

I think the issue was that I was trying to do too much at once with my script. But! Look at the difference in the “smoothness” here. It’s much better.

Thank you for all the input and feedback!

Here is the relevant code I ended up using – lots of the issue was that originally, some of the columns were getting dropped and making it difficult to process further, so I ended up trying to fill in missing data where I could, or substitute 0.0 for data that I couldn’t replace/interpolate.

for data in east_data
    df = df_loop(data)
    # sort dates
    sorted_indices = sortperm(df.decyr)
    df = df[sorted_indices, :]

    # ensure :decyr and :value are of type Float64
    df.decyr = Float64.(df.decyr)
    df.value = Float64.(df.value)

    df = dropmissing(df, [:decyr, :value])

    df_full = DataFrame(decyr = all_years)

    #join df_full with df on :decyr
    df_full = leftjoin(df_full, df, on = :decyr, makeunique=true)

    # set up interpolation function using existing data
    interp_function = linear_interpolation(df.decyr, df.value, extrapolation_bc=Line())

    # interpolate
    interpolated_values = interp_function.(df_full.decyr)

    # Use existing values where available; otherwise, use interpolated values
    df_full.value = coalesce.(df_full.value, interpolated_values)

    # fill in 'missing' columns 
    if :stat in names(df)
        unique_stat = unique(df.stat)
        if length(unique_stat) == 1
            df_full.stat = fill(unique_stat[1], nrow(df_full))
        else
            df_full.stat = fill(missing, nrow(df_full))
        end
    else
        # If :stat is not present, infer from the filename
        station = split(basename(data), "_")[1]
        df_full.stat = fill(station, nrow(df_full))
    end

    # fill in lon and lat for missing
    if "lon" in names(df)
        unique_lon = unique(df.lon)
        df_full.lon = length(unique_lon) == 1 ? fill(unique_lon[1], nrow(df_full)) : fill(missing, nrow(df_full))
    else
        df_full.lon = fill(missing, nrow(df_full))
    end

    if "lat" in names(df)
        unique_lat = unique(df.lat)
        df_full.lat = length(unique_lat) == 1 ? fill(unique_lat[1], nrow(df_full)) : fill(missing, nrow(df_full))
    else
        df_full.lat = fill(missing, nrow(df_full))
    end

    # write
    station = df_full.stat[1]
    if ismissing(station) || station == ""
        # Extract station name from filename if :stat is missing or empty
        station = split(basename(data), "_")[1]
    end
    sort!(df_full, :decyr)
    output_file = "/home/rob/Documents/julia2/ave_data/interp_east/$(station)_oneyear_interp_east.csv"
    CSV.write(output_file, df_full)
end