Rolling Median vs Rolling Mean:

Hi all;

I have a questions that I am hoping someone smarter than me (not a difficult thing to be) can help me out with.

I am smoothing out some GPS/GNSS data and I am wondering why at many of the “excursions” the absolute median value is larger than the average. I, at least intuitively, would expect to see the opposite.

Here is the code I used for the rolling mean:

function moving_window(data, bandwidth)
    
    result = []
    moving_sum = sum(data[1:bandwidth])
    push!(result, round(10 .* (moving_sum / bandwidth); digits = 5))

    for i in (bandwidth + 1):length(data)

        moving_sum += (data[i] - data[i - bandwidth])
        push!(result, round(10 .* (moving_sum / bandwidth); digits = 5))

    end

    return result
end

and I tried two different packages for a rolling median RollingFunction.jl and FastRunningMedian.jl, both of which produced the same outputs.

Here is what I wrote for those:

window7 = 7;
window15 = 15;
window31 = 31;

# didn't provide expected result - trying different package to compare:
#=
w7_result = round.(rollmedian((10 .* agmt_df.value), window7); digits = 5)
w15_result = round.(rollmedian((10 .* agmt_df.value), window15); digits = 5)
w31_result = round.(rollmedian((10 .* agmt_df.value), window31); digits =5)
=#

w7_result = round.(running_median((10 .* agmt_df.value), window7, :none); digits = 5)
w15_result = round.(running_median((10 .* agmt_df.value), window15, :none); digits = 5)
w31_result = round.(running_median((10 .* agmt_df.value), window31, :none); digits = 5)

Here is a comparison of the plots with various window sizes:

30_comparison
15_day_comparison
7_comparison

Any insights would be greatly appreciated!

Thank you :slight_smile:
-Rob

Try debugging this with a test data set that is small enough to be
computed by hand and check that things match.

Can you say more about why you’d expect this – is that a property of the data?

A median can be larger than the mean in many cases, e.g. [1, 1, 0, -2, -2]. It seems like most of the points you see where the median is larger are ones where there’s a sudden drop in the average, and the median updates more slowly.

It’s not a property of the data by any means. I guess I am used to seeing that “smoother” trend when using a median vs a mean, and it just didn’t seem to be the case here.

It makes more sense that my assumption/expectation was wrong, rather than two separate packages being incorrect and producing the same result.

But, here is the data, I’m using plotted behind the mean and medians, just so you can see what the distribution looks like:

30_median_agmt
30_day_mean

I’m thinking I should instead remove the extreme outliers (maybe a density based algorithm), and then try the rolling mean and medians again. :thinking:

Does this fine-scale variation have meaning for you, or are you smoothing way too little here for what you’re actually interested in?

It’s a balance of not wanting to over-smooth and completely remove the anomaly I’m researching. Since I have various lengths of data for different GPS stations, I need to choose a window that makes sense. For this station in particular, I could probably get away with a window of 50-75 days, but, there are some stations that don’t have 1800+ days of data.

This is also precursory examination of the data before I make an animation. This smoothing is to reduce the flicker of the animation to better evaluate the propagation of the anomaly across the US.

For this kind of stuff I’ve used ApproxFun Chebyshev polynomials in Turing models with priors that impose smoothness to extremely good effect.

using ApproxFun, Turing

@model smooth(x,y, t0, tend)
   # overall size of coefficients on the scale of about +- 5, but this is just the base 
   # prior, we modify below
   c ~ MvNormal(fill(0.0,10),fill(5.0,10)) 
   e ~ Gamma(5.0,5.0/4.0) # sd of error around 5

   f = Fun(Chebyshev(t0 .. tend), c)
   fpm = Derivative() * f
   fpmpm = Derivative() * fpm
   ifpmsq = Integral() * (fpm^2)
   ifpmpmsq = Integral() * (fpmpm^2)

   # modify coefficient priors in a dependent way, 
   # encourage low mean square derivative behavior
   Turing.@addlogprob!(Exponential(5.0^2),(ifpmsq(t0)-ifpmsq(tend))/(tend-t0))

   # encourage low mean square 2nd deriv behavior
   Turing.@addlogprob!(Exponential(2.0^2), (ifpmpmsq(t0) - ifpmpmsq(tend))/(tend-t0))

   preds = f.(x)
   y ~ MvNormal(preds,e)
   return(preds)
end

You can tune the priors in terms of the mean squared derivative and 2nd deriv behavior or add additional conditions to your prior knowledge. The prior is expressed in terms of the change per unit time, so is independent of both the number of data points and the span of the data (but not independent of units unless you first non-dimensionalize the data)