Fitting multiple datasets simultaneously

I want to fit a model to multiple datasets simultaneously, with some fitted parameters being shared across datasets. Is it possible to do this using Turing.jl?

Would it work for you to make the model a sub model and then have a single model that declared whatever parameters are shared across models, defines one sub model for each data subset, and then conditions on all data?

1 Like

Seth, that sounds reasonable. The question is how to implement it in Turing.jl. I’m not sure where to begin..

Here’s an example I made up for myself recently when I was trying to figure out how to do this. There are n = 5 datasets here that are quadratic regressions. The linear coefficient b and residual nosise σ are fitted independently for each dataset inside the Regression submodel, while the intercept a and quadratic coefficient c are defined at the top level in MultiRegression and passed as data to the submodels.

using Turing
using StatsPlots

n = 5
a = 1.0 + 2randn()
bb = 2.0 .+ 2randn(n)
c = 5randn()

xx = [2rand(100) for _ in 1:n]
yy = [a .+ b.*x .+ c.*x.^2 .+ randn.() for (b, x) in zip(bb, xx)]
scatter(xx, yy)

@model function Regression(x, y, pars)
    b ~ Normal()
    σ ~ Gamma()
    μ = pars.a .+ b .* x .+ pars.c .* x.^2
    y ~ MvNormal(μ, σ)
end

@model function MultiRegression(xx, yy)
    n = length(xx)
    a ~ Normal()
    c ~ Normal()

    stocks = Vector(undef, n)
    for i in 1:n
        pars = (; a, c)
        stocks[i] ~ to_submodel(Regression(xx[1], yy[i], pars))
    end
end

model = MultiRegression(xx, yy)

chain = sample(model, NUTS(), 1000)
plot(chain)

This seems to work pretty well, though there may be better ways to do it I don’t know about.

1 Like

If the model is linear/polynomial then just set it up like so. From ElOceanografo’s example:

using LinearAlgebra

Xb = cat(xx..., dims=(1,2))
Xc = vcat(xx...)
Xa = ones(size(Xc, 1))
X = hcat(Xa, Xc.^2, Xb)
Y = vcat(yy...)

and solving with least squares:

# phat contains the estimate of [a;c;bb] 
phat = (X'X) \ (X'Y)

or I imagine as Bayesian regression in Turing.jl.

Sorry for the delay – I was away on leave..

Thanks for the advice, I’m pretty confident this is the way forward. However, I’m having some difficulties getting the code running. I’m guessing your to_submodel function may be the key here, as this is where things are going wrong for me.

I’ve whipped up an example that might help explain my problem:

using Turing
using DifferentialEquations
using StatsPlots

function SIR(du, u, p, t)

    β, γ = p
    Sn, In, Rn = u

    Ntot = (Sn+In+Rn)
    du[1] = - β * Sn*In/Ntot            # S
    du[2] = + β * Sn*In/Ntot - γ*In     # I
    du[3] = + γ*In                      # R

    return nothing
end

@model function fitSIRtoR(data, timesteps, fwdSim)

    σ ~ InverseGamma(2, 3)
    β ~ truncated(Normal(1.2, 0.5); lower=0.1,upper=2)    
    γ ~ truncated(Normal(3.0, 0.5); lower=0.0, upper=2)
    
    p = [β,γ]
    predicted = solve(fwdSim, Tsit5(); p=p, saveat=timesteps)

    for i in eachindex(timesteps)
        data[i] ~ Normal(predicted[3,i], σ^2)
    end

    return nothing
end

# Generate data
u0 = [99.0, 1.0, 0.0]
p =  [1.0, 0.1]
tspan = (0.0, 50.0)
fwdSim = ODEProblem(SIR, u0, tspan, p)
sol = solve(fwdSim, Tsit5(); saveat=1)

odedata = sol[3,:] 
ind = trunc.(Int64, collect(range(1, stop = length(sol.t), length = 8)))
samples = odedata[ind]
timesteps = sol.t[ind]


# Fit model
model = fitSIRtoR(odedata, timesteps, fwdSim)
chain = sample(model, NUTS(), MCMCSerial(), 1000, 3; progress=true)
plot(chain)

And here’s a different version where I try to incorporate your solution for multiple datasets:

@model function Regression(data, timesteps, fwdSim, pars)

    # Fitted independently for each dataset
    σ ~ InverseGamma(2, 3)
    β ~ truncated(Normal(1.2, 0.5); lower=0.1,upper=2)    
  
    p = vcat(β, pars)
    predicted = solve(fwdSim, Tsit5(); p=p, saveat=timesteps)

    for i in eachindex(timesteps)
        data[i] ~ Normal(predicted[3,i], σ^2)
    end

   data
end


@model function MultiRegression(datasets, timesteps, fwdSim)

    γ ~ truncated(Normal(3.0, 0.5); lower=0.0, upper=2)
    pars = [γ]

    n = size(datasets)[2]

    y = Vector(undef, n)
    for i in 1:n
        y[i] ~ Regression(datasets[:,i], timesteps, fwdSim, pars)
    end
    
    return nothing
end

# Generate data
u0 = [99.0, 1.0, 0.0]
p =  [1.0, 0.1]
tspan = (0.0, 50.0)
fwdSim = ODEProblem(SIR, u0, tspan, p)
sol = solve(fwdSim, Tsit5(); saveat=1)

odedata = sol[3,:]
ind = trunc.(Int64, collect(range(1, stop = length(sol.t), length = 8)))
samples = odedata[ind]
timesteps = sol.t[ind]

noise = 0.5
numDatasets = 4
d = Normal.(samples, noise) 
td = truncated.(d, 0.0, 1)
rtd = rand.(td, numDatasets)
datasets = mapreduce(permutedims, vcat, rtd)

# Fit data to model
model = MultiRegression(datasets, timesteps, fwdSim)
chain = sample(model, NUTS(), MCMCSerial(), 1000, 3; progress=true)
plot(chain)

I get the following error:

ArgumentError: the right-hand side of a ~ must be a Distribution or an array of Distribution's

Any idea what I might be doing wrong?

Thanks again to everyone for past and future help.

Inside MultiRegression, you just need to wrap Regression in a to_submodel call and it will run. I.e. change the line inside the loop to

y[i] ~ to_submodel(Regression(datasets[:,i], timesteps, fwdSim, pars))

and it will run. See Turing’s submodel docs: Submodels – Turing.jl

1 Like

For some reason, for me, to_submodel isn’t recognised as a function. I’m using Turing v0.40.3, so it should be there. And I’m assuming DynamicPPL is a dependency of Turing so I don’t need to add that. I’m sure it’s something stupid I’ve missed. Any ideas?

Thanks again.

It’s got to be a package/environment issue…your code ran for me in a temp environment. First advice would be to update your packages and restart your Julia session. What’s the output of pkg> st?

I needed a restart. Rookie mistake. Thanks so much!! You’ve been a star :star_struck: