Interpolation/smoothing/decimating of unevenly spaced data

I have experimental data on two channels p1(t), p2(t), both taken on different and unevenly spaced time points (approx. each second, with jitter and some data lacking). The data size is some 10^5 points each. I need f(t) = p1(t)-p2(t) , and I actually don’t need that many datapoints, 10 times less should be more that enough.

I’m thinking of using e.g. Interpolations.jl to interpolate both data sets to integer time values and then use e.g. RollingFunctions.jl to get moving average and decimate.

Any better ideas? Actually I have never used any of these packages

I think the approach is reasonable. If your data is coming from a live “stream” you may also take a look at OnlineStats.jl or packages that model time series more explicitly. If the ultimate goal is to subtract the signals, the approach with Interpolations.jl seems perfectly fine (assuming the signals are smooth).

juliohm, thank you. The data are a bit noisy, smoothing is one of the reasons to use moving average - just I don’t know of any package for smoothing unevenly spaced data, thus first interpolating onto an evenly spaced grid. No, the data are not from a live stream.

SavitzkyGolay.jl is probably exactly what I need for the second part, instead of RollingFunctions.jl

Could you share some pictures of your data?

The project is not exactly public, but well, here is a data subset:


The noise is not actually seen on this scale, what we see is some internal discretization behavior of the sensors.

I know (especially after looking onto the plot), I could just take the datapoints nearest to my new 10-seconds grid, and that would be OK in principle. Or use Savitzky-Golay, ignoring mis-alignment of the data.

For that kind of data, linear interpolation and some basic filter (ex: running median) might be enough, maybe running the latter first.

rafael.guerra, thank you, running median is a good idea. Not difficult to implement - but is there anything ready-to-use? I mean, for each 10-seconds-node take the median of the data within ±5 … ±7 … ±10 seconds span - and that’s all.

The following running median code was adapted from @Tim.Holly’s moving mean using CartesianIndices.

You can try it:

using Statistics

function moving_median(A::AbstractArray, m::Int)
    out = similar(A)
    R = CartesianIndices(A)
    Ifirst, Ilast = first(R), last(R)
    I1 = m÷2 * oneunit(Ifirst)
    for I in R
        out[I] = median(view(A, max(Ifirst, I-I1):min(Ilast, I+I1)))
    end
    return out
end

Here’s a simple running median that works for unevenly spaced data, unlike the indexing solution:

julia> using Statistics, StructArrays, FlexiJoins, IntervalSets, DataPipes

# values of p measured at arbitrary times t
julia> data = StructArray(t=1000*rand(10^5), p=rand(10^5));

# the target times: uniform spacing of 10 units
julia> T = 0:10:1000

# join data to T within +- 5 units of time, and find the p median within each of these windows
julia> @p innerjoin((;T, data), by_pred(t -> t ± 5, ∋, x -> x.t); groupby=:T) |>
          map((t=_.T, p=median(_.data.p)))
101-element StructArray(::Vector{Int64}, ::Vector{Float64}) with eltype NamedTuple{(:t, :p), Tuple{Int64, Float64}}:
 (t = 0, p = 0.4829278551580626)
 (t = 10, p = 0.4985642165220812)
 (t = 20, p = 0.5088587711021509)
 (t = 30, p = 0.48814021971874477)
 (t = 40, p = 0.5024018246441164)
...

This gets you a uniform time series that you can do further computations with.

1 Like

The indexing solution does work for unevenly spaced data, but the window is defined in number of points, not spatial dimensions.

1 Like

needed

using IntervalSets

on my setup.

Thanks to everyone, but I ended up writing my own solution, being unable to read the one-liner by aplavin (or rather reluctant to read docs to StructArrays, FlexiJoins, IntervalSets) :blush:

using Random, Statistics

# create test data set
Random.seed!(18122022) 
xdata = sort((20:100) .+ 10 .* rand.()) 
deleteat!(xdata, 40:60) # create a gap
ydata = xdata .+ 3 .* rand.()

# will use integer indexes: x = spacing * i
# this will simplify operations between channels
span_idx(x, spacing, mode) = Int(round(x/spacing, mode))

# the span of indices s such that xs[begin] <= spacing*s.start , spacing*s.stop <= xs[end]
getspan(xs, spacing) = span_idx(xs[begin], spacing, RoundUp) : span_idx(xs[end], spacing, RoundDown)

# get x-points within idx*spacing ± spacing/2
function in_span(xs, idx, spacing)
    x0 = idx*spacing
    start = searchsorted(xs, x0-spacing/2).start
    stop  = searchsorted(xs, x0+spacing/2).stop
    start > stop && return nothing
    return start:stop
end

# median values for each span
function med(xs, ys, spacing)
    span = getspan(xs, spacing)
    ymed = Vector{Float64}(undef, length(span))
    for i in eachindex(span)
        ix = in_span(xs, span[i], spacing)
        ymed[i] = isnothing(ix) ? NaN : median(ys[ix]) # set to NaN if there are no x-points within the window
    end
    return (;span, ydata=ymed, spacing)
end

(span, ys, spacing) = med(xdata, ydata, 10)
xs = spacing .* span

julia> ys
8-element Vector{Float64}:
  31.27409597805495
  41.07615877946975
  51.24513527900122
  59.62578121779753
 NaN
  85.00400046877309
  93.13812424712954
 102.08977513815768

julia> xs
30:10:100