I want to fit a model to multiple datasets simultaneously, with some fitted parameters being shared across datasets. Is it possible to do this using Turing.jl?
Would it work for you to make the model a sub model and then have a single model that declared whatever parameters are shared across models, defines one sub model for each data subset, and then conditions on all data?
Seth, that sounds reasonable. The question is how to implement it in Turing.jl. I’m not sure where to begin..
Here’s an example I made up for myself recently when I was trying to figure out how to do this. There are n = 5
datasets here that are quadratic regressions. The linear coefficient b
and residual nosise σ
are fitted independently for each dataset inside the Regression
submodel, while the intercept a
and quadratic coefficient c
are defined at the top level in MultiRegression
and passed as data to the submodels.
using Turing
using StatsPlots
n = 5
a = 1.0 + 2randn()
bb = 2.0 .+ 2randn(n)
c = 5randn()
xx = [2rand(100) for _ in 1:n]
yy = [a .+ b.*x .+ c.*x.^2 .+ randn.() for (b, x) in zip(bb, xx)]
scatter(xx, yy)
@model function Regression(x, y, pars)
b ~ Normal()
σ ~ Gamma()
μ = pars.a .+ b .* x .+ pars.c .* x.^2
y ~ MvNormal(μ, σ)
end
@model function MultiRegression(xx, yy)
n = length(xx)
a ~ Normal()
c ~ Normal()
stocks = Vector(undef, n)
for i in 1:n
pars = (; a, c)
stocks[i] ~ to_submodel(Regression(xx[1], yy[i], pars))
end
end
model = MultiRegression(xx, yy)
chain = sample(model, NUTS(), 1000)
plot(chain)
This seems to work pretty well, though there may be better ways to do it I don’t know about.
If the model is linear/polynomial then just set it up like so. From ElOceanografo’s example:
using LinearAlgebra
Xb = cat(xx..., dims=(1,2))
Xc = vcat(xx...)
Xa = ones(size(Xc, 1))
X = hcat(Xa, Xc.^2, Xb)
Y = vcat(yy...)
and solving with least squares:
# phat contains the estimate of [a;c;bb]
phat = (X'X) \ (X'Y)
or I imagine as Bayesian regression in Turing.jl.
Sorry for the delay – I was away on leave..
Thanks for the advice, I’m pretty confident this is the way forward. However, I’m having some difficulties getting the code running. I’m guessing your to_submodel
function may be the key here, as this is where things are going wrong for me.
I’ve whipped up an example that might help explain my problem:
using Turing
using DifferentialEquations
using StatsPlots
function SIR(du, u, p, t)
β, γ = p
Sn, In, Rn = u
Ntot = (Sn+In+Rn)
du[1] = - β * Sn*In/Ntot # S
du[2] = + β * Sn*In/Ntot - γ*In # I
du[3] = + γ*In # R
return nothing
end
@model function fitSIRtoR(data, timesteps, fwdSim)
σ ~ InverseGamma(2, 3)
β ~ truncated(Normal(1.2, 0.5); lower=0.1,upper=2)
γ ~ truncated(Normal(3.0, 0.5); lower=0.0, upper=2)
p = [β,γ]
predicted = solve(fwdSim, Tsit5(); p=p, saveat=timesteps)
for i in eachindex(timesteps)
data[i] ~ Normal(predicted[3,i], σ^2)
end
return nothing
end
# Generate data
u0 = [99.0, 1.0, 0.0]
p = [1.0, 0.1]
tspan = (0.0, 50.0)
fwdSim = ODEProblem(SIR, u0, tspan, p)
sol = solve(fwdSim, Tsit5(); saveat=1)
odedata = sol[3,:]
ind = trunc.(Int64, collect(range(1, stop = length(sol.t), length = 8)))
samples = odedata[ind]
timesteps = sol.t[ind]
# Fit model
model = fitSIRtoR(odedata, timesteps, fwdSim)
chain = sample(model, NUTS(), MCMCSerial(), 1000, 3; progress=true)
plot(chain)
And here’s a different version where I try to incorporate your solution for multiple datasets:
@model function Regression(data, timesteps, fwdSim, pars)
# Fitted independently for each dataset
σ ~ InverseGamma(2, 3)
β ~ truncated(Normal(1.2, 0.5); lower=0.1,upper=2)
p = vcat(β, pars)
predicted = solve(fwdSim, Tsit5(); p=p, saveat=timesteps)
for i in eachindex(timesteps)
data[i] ~ Normal(predicted[3,i], σ^2)
end
data
end
@model function MultiRegression(datasets, timesteps, fwdSim)
γ ~ truncated(Normal(3.0, 0.5); lower=0.0, upper=2)
pars = [γ]
n = size(datasets)[2]
y = Vector(undef, n)
for i in 1:n
y[i] ~ Regression(datasets[:,i], timesteps, fwdSim, pars)
end
return nothing
end
# Generate data
u0 = [99.0, 1.0, 0.0]
p = [1.0, 0.1]
tspan = (0.0, 50.0)
fwdSim = ODEProblem(SIR, u0, tspan, p)
sol = solve(fwdSim, Tsit5(); saveat=1)
odedata = sol[3,:]
ind = trunc.(Int64, collect(range(1, stop = length(sol.t), length = 8)))
samples = odedata[ind]
timesteps = sol.t[ind]
noise = 0.5
numDatasets = 4
d = Normal.(samples, noise)
td = truncated.(d, 0.0, 1)
rtd = rand.(td, numDatasets)
datasets = mapreduce(permutedims, vcat, rtd)
# Fit data to model
model = MultiRegression(datasets, timesteps, fwdSim)
chain = sample(model, NUTS(), MCMCSerial(), 1000, 3; progress=true)
plot(chain)
I get the following error:
ArgumentError: the right-hand side of a ~ must be a Distribution or an array of Distribution's
Any idea what I might be doing wrong?
Thanks again to everyone for past and future help.
Inside MultiRegression
, you just need to wrap Regression
in a to_submodel
call and it will run. I.e. change the line inside the loop to
y[i] ~ to_submodel(Regression(datasets[:,i], timesteps, fwdSim, pars))
and it will run. See Turing’s submodel docs: Submodels – Turing.jl
For some reason, for me, to_submodel
isn’t recognised as a function. I’m using Turing v0.40.3
, so it should be there. And I’m assuming DynamicPPL
is a dependency of Turing
so I don’t need to add that. I’m sure it’s something stupid I’ve missed. Any ideas?
Thanks again.
It’s got to be a package/environment issue…your code ran for me in a temp environment. First advice would be to update your packages and restart your Julia session. What’s the output of pkg> st
?
I needed a restart. Rookie mistake. Thanks so much!! You’ve been a star