Data stitching

I’m looking for a package or an algorithm, alternatively an explanation of why what I want is impossible or bad:

I have a set of data x1, y1 ≈ f.(x1) and one x2, y2 ≈ a*f.(x2) .+ b for some natural black box function f and constants a, b. I would like to merge these two datasets by scaling and shifting y2, essentially finding a and b. The physical motivation is, for instance, spectrometry of the same features using different integration times.

If enough of x2 has “close neighbors” in x1, this is easy to do decently manually, but I find surprisingly little information on automating the process. My current approach is to minimize wrt p the sum over each possible pair of points (x1[i], y1[i]), (x2[j], y2[j]) of the error function

(p[1]*y2+p[2] - y1)^2/(1 + (x2 - x1)^2/w^2)

where the points are weighted by the lorentzian 1/(1 + Δx^2/w^2) of width w chosen as some multiple of mean(diff(x)). Results are not great so far.

Has anyone here seen anything similar to this? Image stitching seems adjacent, but not a perfect match.

MWE:

using Plots, Optim
f(x) = exp(-x^2)
x1 = randn(50) .- 1.0
x2 = randn(30) .+ 1.0
y1 = f.(x1)
y2 = 1.5*f.(x2) .+ 0.5
p1 = plot([x1, x2], [y1, y2]; seriestype=:scatter) # don't line up
w = 0.5*sum(diff(sort(vcat(x1, x2))))/(length(x1)+length(x2))
error_fun(p) = sum(
    (p[1]*y2 .+ p[2] - y1)^2/(1 + (x2 - x1)^2/w^2) for (x1, y1) in zip(x1, y1), (x2, y2) in zip(x2, y2)
)
result = optimize(error_fun, [1.0, 0.0]).minimizer
p2 = plot(vcat(x1, x2), vcat(y1, result[1]*y2 .+ result[2]); seriestype=:scatter) # should line up
plot!(p2, f; seriestype=:path)
plot(p1, p2)

stitch

1 Like

My usual thing when faced with unknowns is to formulate a Bayesian inference problem.

I’m going to assume that you don’t have an explicit form for the f() function. In that case, at least what you have is some noisy measures of f() in your plot maybe that’s the blue data. And you have some noisy measures of a*f()+b the red data.

I would choose some small ranges of data, and say for each range that a line fits through the data… (let’s call this a set of of local taylor series of f)

blue ~ Normal((m*(x-xcenter) + c),scale)
red ~ Normal(a * (m*(x-xcenter) + c) + b, scale*a)

where there’s a different m and c for each small range and xcenter is the center of each range. You probably don’t need to deal with all the ranges, but several ranges where m varies is probably preferred because you will get better inference for a I think.

The a and b would be shared across all the ranges, and the m and c would be different for each range.

Sample from this posterior distribution using Turing, and then you’ll see not only inference for a and b but also their uncertainty, and the correlation between them.

1 Like

I can think here of two approaches. One - maybe the data can be described by some relatively simple function with a few (3…6) parameter - a polynom, rational function, gauss, some combination of these, whatever (“With four parameters I can fit an elephant, and with five I can make him wiggle his trunk.” :slightly_smiling_face:). It is better if there were some physical rational behind, but not necessary. Then, first fit the first dataset, then add the a and b to the parameters and fit the combined dataset.

Another approach - local one. Interpolate both data separately onto a regular grid, then you have your “new” data all with the same x-values, then just do LSQ fit for a and b .

2 Likes

Here is my attempt (please note that statistics is something I would really like to be good at; but I’m not…)

If the sample points are dense enough, a simple interpolation could be sufficient to approximate f.(x2). Once we have an approximation of f.(x2) the rest is linear regression.

using CurveFit, Interpolations

f(x) = exp(-x^2)
x1 = randn(50) .- 1.0
x2 = randn(30) .+ 1.0
y1 = f.(x1)
y2 = 1.5*f.(x2) .+ 0.5

p = sortperm(x1)
permute!(x1, p)
permute!(y1, p)
f_fit = linear_interpolation(x1, y1)

# restrict to range of x1  (otherwise one would need a better extrapolation model)
ind = minimum(x1) .< x2 .< maximum(x1)
b, a = linear_fit(f_fit( x2[ind] ), y2[ind])
# a = 1.5092573052474938
# b = 0.4960624235320712
1 Like

Not if there is significant noise in the measurements, interpolation will be fitting a lot of noise

@SteffenPL basically did what I wanted. So this is just another version, incorporating intuition from original and other posts that close points are key. Luckily we have to estimate just two parameters, and we may get over some noise.
Now the code:

using IterTools

f(x) = exp(-x^2)
x1 = randn(50) .- 1.0
x2 = randn(30) .+ 1.0
y1 = f.(x1)
y2 = 1.5*f.(x2) .+ 0.5

function findclosepairs(x1, x2, thresh)
    res = Tuple{Float64,Int,Int}[]
    for i in eachindex(x1), j in eachindex(x2) 
        abs(x1[i]-x2[j])<thresh && push!(res, (abs(x1[i]-x2[j]),i,j))
    end
    sort!(res)
end

function threshsuggest(x1,x2)
    xs = vcat([(x,0) for x in x1], [(x,1) for x in x2])
    sort!(xs)
    xdeltas = sort(map(i->abs(xs[i][1]-xs[i-1][1]), filter(i->xs[i][2] != xs[i-1][2], 2:length(xs))))
    return xdeltas[clamp(length(xdeltas)÷5, 1, length(xdeltas))]
end

function findaffine(y1, y2, cp)
    xx = y1[nth.(cp, 2)]
    yy = y2[nth.(cp, 3)]
    M = [ones(size(xx)) xx]
    v = M \ yy
end

After this prep, we can:

julia> cp = findclosepairs(x1, x2, threshsuggest(x1, x2))
4-element Vector{Tuple{Float64, Int64, Int64}}:
 (0.0018581113413391748, 46, 13)
 (0.0034200499950651775, 18, 2)
 (0.005578601033654262, 47, 15)
 (0.006237570357774835, 17, 2)

julia> findaffine(y1,y2,cp)
2-element Vector{Float64}:
 0.507535289631714
 1.4827932614125852

The actual estimation is more accurate with just 3 pairs.
But I guess this code is a good part of the automation OP discussed (and some more optimizations possible for sure).

1 Like

I’m working up a Turing example based on your MWE and including it as one of my Vignettes for a series on “how to use Julia for data analysis”

Ok, here’s my example. The Turing sampling took 2-3 seconds.


using StatsPlots, Turing, LinearAlgebra,Random

Random.set_global_seed!(1)

f(x) = exp(-x^2)
x1 = randn(50) .- 1.0
x2 = randn(30) .+ 1.0
y1 = f.(x1)
y2 = 1.5*f.(x2) .+ 0.5 ## true values of a = 1.5 and b = 0.5

p1 = plot([x1, x2], [y1, y2]; seriestype=:scatter) # don't line up


display(p1)

## how to determine a,b if we don't know f()?

@model function stitchdata(xbase,ybase,xtform,ytform)
    a ~ Gamma(5,1.0/4) # a is of order 1
    b ~ Gamma(5,1.0/4) # b is of order 1
    m ~ MvNormal(repeat([0.0],4),100.0^2*I(4))
    c ~ MvNormal(repeat([0.0],4),100.0^2*I(4))
    xbneighs = Vector{eltype(xbase)}[]
    ybneighs = Vector{eltype(ybase)}[]
    xtneighs = Vector{eltype(xbase)}[]
    ytneighs = Vector{eltype(ybase)}[]
    centers = (-1.0,-0.5,.5,1.0)
    for (i,dx) in enumerate(centers)
        f = x -> x > dx - .25 && x < dx + 0.25
        xneighindex = findall(f, xbase)
        #@show xneighindex
        push!(xbneighs, xbase[xneighindex])
        push!(ybneighs, ybase[xneighindex])
        tneighindex = findall(f,xtform)
        push!(xtneighs, xtform[tneighindex])
        push!(ytneighs, ytform[tneighindex])
    end
    for (i,(x,y)) in enumerate(zip(xbneighs,ybneighs))
        Turing.@addlogprob!(logpdf(MvNormal(m[i].*(x .- centers[i]) .+ c[i], 0.05^2*I(length(y))), y))
    end
    for (i,(x,y)) in enumerate(zip(xtneighs,ytneighs))
        Turing.@addlogprob!(logpdf(MvNormal(a.*(m[i].*(x .- centers[i]) .+ c[i]) .+ b, 0.05^2*I(length(y))), y))
    end
end

mm = stitchdata(x1,y1,x2,y2)

s = sample(mm,NUTS(400,0.8),200)

plot(s[:,[:a,:b],1])

This produces the data:

And the inference for a and b are:

I just ad-hoc decided on the “neighborhoods” to use.

I had to use the @addlogprob! macro because I’m subsetting the data and Turing is confused about that if I don’t.

Also, it might make more sense to actually have some measurement noise and/or fit the scale of the measurement noise…

when I added Normal(0.0,0.05) noise to the data, I get:

and

5 Likes

Nice! Guess I’m learning Turing… As an addition, one could run a final

y3 = (y2 .- mean(s[:b]))/mean(s[:a])
plot([x1, x2], [y1, y3]; seriestype=:scatter)
plot!(f; seriestype=:path)

to get it back in the domain of data stitching.

Turing certainly gives the prettiest solution. Here is another Turing model for this problem. I’ve tried to simplify a bit the model for people getting anxious from @addlogprob!:

using StatsPlots, Turing, LinearAlgebra,Random, IterTools
Random.set_global_seed!(1)

# prep example data
f(x) = exp(-x^2)
x1 = randn(50) .- 1.0
x2 = randn(30) .+ 1.0
y1 = f.(x1)
y2 = 1.5*f.(x2) .+ 0.5 ## true values of a = 1.5 and b = 0.5

# helper functions for close pairs
function findclosepairs(x1, x2, thresh)
    res = Tuple{Float64,Int,Int}[]
    for i in eachindex(x1), j in eachindex(x2) 
        abs(x1[i]-x2[j])<=thresh && push!(res, (abs(x1[i]-x2[j]),i,j))
    end
    sort!(res)
end

function threshsuggest(x1,x2)
    xs = vcat([(x,0) for x in x1], [(x,1) for x in x2])
    sort!(xs)
    xdeltas = sort(map(i->abs(xs[i][1]-xs[i-1][1]), filter(i->xs[i][2] != xs[i-1][2], 2:length(xs))))
    return xdeltas[clamp(length(xdeltas)÷5, 1, length(xdeltas))]
end

# find close pairs for Bayesian model
cp = findclosepairs(x1, x2, 4*threshsuggest(x1, x2))
xbase = x1[nth.(cp, 2)]
ybase = y1[nth.(cp, 2)]
xtform = x2[nth.(cp, 3)]
ytform = y2[nth.(cp, 3)]

# Bayesian model or "how likely is the data given some parameters"
@model function bayes_stitch(xbase,ybase,xtform,ytform)
    a ~ Gamma(5,1.0/4) # a is of order 1
    b ~ Gamma(5,1.0/4) # b is of order 1
    σ ~ InverseGamma(1.0,1.0)
    # using Exponential(0.1) also good, especially for wackier f(.)
    L = length(xbase)
    for i=1:L
        ytform[i] ~ Normal(a*ybase[i]+b, abs(xtform[i]-xbase[i])*σ)
        # instead of Normal this could be Cauchy for noisier data
    end
end

# create and sample model from closepoint pairs data
bayesmm = bayes_stitch(xbase, ybase, xtform, ytform);
s = sample(bayesmm,NUTS(400,0.8),200)

plot(s[:,[:a,:b,:σ],1])

This gives:

Notes:

  • The resulting variance posterior distribution is not too large, meaning the rest of the data fits the model decently (otherwise model would claim variance is high and everything is noise).
  • The goal of this version is to remove some manual parameters from other model and simplify the model description.
  • Having said that, this model loses a bit of power that @dlakelan 's model gets from large clusters (more than 2 points).
  • There are still a few parameters which should depend on nature of samples and especially nature of f(.):
    • choosing close points threshold (or list size)
    • choosing prior distributions for variance and error
2 Likes

A beginner’s question for probability experts:

For the above models that are free of noise and using a trivial case of dense overlapping input data (x1 = -4:0.1:4; x2 = -4:0.1:4), why are the Turing solutions (highest probability peak, mean or median) always shifted from the true solution (a = 1.5; b = 0.5), as if there is a bias…

Any reference or link would be great.

Did you try this? When there are 2 x points in both x1 and x2, the distributions should very concentrated (and your ranges have tons of x’s in common).

For example, in the model I’ve just ran, there is a varince of abs(x1-x2)*sigma which would be zero in this case (forcing the approx equality to be equality).

This comment applies to the model I’ve shown. dlakelan’s model is a bit more lineant.
The priors introduce a bit of bias (which decreases as more data is given).

Plugging-in such x1 and x2 in the first dlakelan’s model, the results plot like:

Because essentially the sampler is choosing a best-fit line, there is some ambiguity as to the slope, and when you under-estimate the slope (a), you need to compensate with over-estimating the intercept (b).

Thus the uncertainty is in an ellipse-ish shape in a-b space but not an axis-parallel ellipse, and the distributions we are getting are the marginals of this rotated ellipse-ish.

Non-axis ellipses projection are still centered, but in this case, there is a non-linearity in f(.) which introduces heteroscedasticity (changing variance of b as function of a), and this distorts the ellipses and produces this bias.

This can be tested using a linear f such as f(x) = kx+h.

Another issue is the convergence quality / chain correlation and length of the Bayesian sampler chains.

Perhaps someone will give a good link in a reply.

UPDATE:

As suggested, I’ve tested dlakelan model with -4:0.1:4 and linear function f(x). The resulting distributions:

3 Likes

Indeed the priors basically say that both a and b are most likely to be 1.0 so it is pulling the slope down a little and the offset up a little. It would be possible to make a prior which was a little flatter through this region if desired.

Also the line is fitting over a region of width 0.5 but over this region there is some mild nonlinearity. It would be possible to introduce an extra Taylor series term and handle this nonlinearity a little better.

I’ve got a plan to produce a bunch of vignettes about doing common tasks in Julia and put them on YouTube. In addition to that I have some plans for larger multi-part projects. One of my plans is to talk about Bayesian inference in general, and Turing in particular. The thing is that my shortest coverage of these ideas was at least 1 hr long, which I think is too much to plop into my YouTube stream before people have come up to speed.

in any case, I hope you’ll tune in when I get around to doing the Bayesian + Turing version. I’ll post the YouTube playlist probably next week. Right now I have only longer form content and I decided the vignettes are really needed.

3 Likes

This problem appears to be a nice fit for Gaussian processes, i.e., using the following generative model:

\begin{align*} f &\sim GP(\cdot) \\ y_1 &\sim \mathcal{N}(f(x_1), \sigma) \\ y_2 &\sim \mathcal{N}(a \cdot f(x_1) + b, \sigma) \end{align*}

Thus, the Gaussian process describes the latent/unobserved function f and depending on the kernel takes automatically care of smoothing, i.e., drawing information from neighbouring data points. I’m not quite sure which kernel would correspond to your Lorentzian weighting though. Where did you get that formula from? Just a modelling assumptions or is there a more fundamental reason for choosing 1 / \left(1 + \left(\frac{\Delta x}{w}\right)^2\right) to weight neighbouring data points?

Just intuition. Didn’t want stuff far away to factor in, lorentzians are fairly simple. Could just as well have been a gaussian or a tent.

Ok, then you might want to try Gaussian processes which allow you to compare different smoothness assumptions on f as well as learn the effective neighbourhood range via kernel parameters.
When I find some time, I might try to put together a MWE Turing example …

2 Likes