I have a questions that I am hoping someone smarter than me (not a difficult thing to be) can help me out with.
I am smoothing out some GPS/GNSS data and I am wondering why at many of the “excursions” the absolute median value is larger than the average. I, at least intuitively, would expect to see the opposite.
Here is the code I used for the rolling mean:
function moving_window(data, bandwidth)
result = []
moving_sum = sum(data[1:bandwidth])
push!(result, round(10 .* (moving_sum / bandwidth); digits = 5))
for i in (bandwidth + 1):length(data)
moving_sum += (data[i] - data[i - bandwidth])
push!(result, round(10 .* (moving_sum / bandwidth); digits = 5))
end
return result
end
and I tried two different packages for a rolling median RollingFunction.jl and FastRunningMedian.jl, both of which produced the same outputs.
Can you say more about why you’d expect this – is that a property of the data?
A median can be larger than the mean in many cases, e.g. [1, 1, 0, -2, -2]. It seems like most of the points you see where the median is larger are ones where there’s a sudden drop in the average, and the median updates more slowly.
It’s not a property of the data by any means. I guess I am used to seeing that “smoother” trend when using a median vs a mean, and it just didn’t seem to be the case here.
It makes more sense that my assumption/expectation was wrong, rather than two separate packages being incorrect and producing the same result.
But, here is the data, I’m using plotted behind the mean and medians, just so you can see what the distribution looks like:
I’m thinking I should instead remove the extreme outliers (maybe a density based algorithm), and then try the rolling mean and medians again.
It’s a balance of not wanting to over-smooth and completely remove the anomaly I’m researching. Since I have various lengths of data for different GPS stations, I need to choose a window that makes sense. For this station in particular, I could probably get away with a window of 50-75 days, but, there are some stations that don’t have 1800+ days of data.
This is also precursory examination of the data before I make an animation. This smoothing is to reduce the flicker of the animation to better evaluate the propagation of the anomaly across the US.
For this kind of stuff I’ve used ApproxFun Chebyshev polynomials in Turing models with priors that impose smoothness to extremely good effect.
using ApproxFun, Turing
@model smooth(x,y, t0, tend)
# overall size of coefficients on the scale of about +- 5, but this is just the base
# prior, we modify below
c ~ MvNormal(fill(0.0,10),fill(5.0,10))
e ~ Gamma(5.0,5.0/4.0) # sd of error around 5
f = Fun(Chebyshev(t0 .. tend), c)
fpm = Derivative() * f
fpmpm = Derivative() * fpm
ifpmsq = Integral() * (fpm^2)
ifpmpmsq = Integral() * (fpmpm^2)
# modify coefficient priors in a dependent way,
# encourage low mean square derivative behavior
Turing.@addlogprob!(Exponential(5.0^2),(ifpmsq(t0)-ifpmsq(tend))/(tend-t0))
# encourage low mean square 2nd deriv behavior
Turing.@addlogprob!(Exponential(2.0^2), (ifpmpmsq(t0) - ifpmpmsq(tend))/(tend-t0))
preds = f.(x)
y ~ MvNormal(preds,e)
return(preds)
end
You can tune the priors in terms of the mean squared derivative and 2nd deriv behavior or add additional conditions to your prior knowledge. The prior is expressed in terms of the change per unit time, so is independent of both the number of data points and the span of the data (but not independent of units unless you first non-dimensionalize the data)