Hidden Markov Model with fixed parameters

I am trying to model a sequence of behaviours with two end points as a hidden markov model. To do this, I tried fixing some parameters in the transition and emission matrices, but I ran into the following error:

MethodError: no method matching Float64(::ForwardDiff.Dual{ForwardDiff.Tag{DynamicPPL.DynamicPPLTag, Float64}, Float64, 9})

I’ve tried finding related issues and believe that I’d have to be working with Dual numbers as explained by @gdalle in this post: Autodifferentiation of model with fixed and mutable parameters - #3 by gdalle, but I’m not sure how to apply this in the context of Turing.

Some context about the setting:
In the experiment, ravens solved puzzles and we want to understand if they used either a “trial-and-error” or a “goal-oriented” strategy (the hidden states). The observations indicate whether the raven manipulated a “goal-relevant” or “goal-irrelevant” part of the puzzle. The trials are nested in experimental conditions but these aren’t relevant for the MWE below. Finally, the end states represent whether the raven succeeded or failed at solving a given puzzle.

Here’s what I’ve tried so far:

using HiddenMarkovModels
using Turing
using NNlib
using LinearAlgebra

# Simulation for two hidden states and two end states

trans = [0.92 0.05 0.025 0.005;
         0.03 0.94 0.005 0.025;
          0.0  0.0   1.0   0.0;
          0.0  0.0   0.0   1.0]

emiss = [0.9 0.1 0.0 0.0;
         0.1 0.9 0.0 0.0;
         0.0 0.0 1.0 0.0;
         0.0 0.0 0.0 1.0]

init = [0.6, 0.4, 0.0, 0.0]

dists = HiddenMarkovModels.LightCategorical.(eachrow(emiss))

hmm = HMM(init, trans, dists)

obs_seqs = [last(rand(hmm, 100)) for _ in 1:10]

function truncate_at_end(x; endstate=[3, 4])
    i = findfirst(in(endstate), x)
    return isnothing(i) ? x : x[1:i]
end

trunc_obs_seqs = truncate_at_end.(obs_seqs)

# Helper functions for the model

function with_endstates(v, size)
    a = zeros(size)
    for (i, x) in enumerate(v)
        a[i] = x
    end
    return a
end

function with_endstates(M::AbstractMatrix, size; fld=size)
    A = Matrix(Diagonal(ones(size)))
    for (i, m) in enumerate(M)
        A[fldmod1(i, fld)...] = m
    end
    return A
end

forward_loglikelihood(hmm, obs_seq) = only(last(forward(hmm, obs_seq)))

# Turing model

# K = number of hidden states
# E = number of end states

@model function unsupervised_hmm(K, E, obs_seqs)
    π1 ~ Dirichlet(ones(K)) # can't start in the end states
    Θ ~ filldist(Dirichlet(ones(K + E)), K)
    Φ ~ filldist(Dirichlet(ones(K)), K) # only end states can emit end observations

    init = with_endstates(π1, K + E)
    trans = with_endstates(Θ, K + E)
    emiss = with_endstates(permutedims(Φ), K + E; fld=2)
    dists = [Categorical(emiss[:, i]) for i in axes(emiss, 2)]

    for obs_seq in obs_seqs
        hmm = HMM(init, trans, dists)
        Turing.@addlogprob! forward_loglikelihood(hmm, obs_seq)
    end

    return nothing
end

model = unsupervised_hmm(2, 2, trunc_obs_seqs)
chn = sample(model, NUTS(), 500)

Thank you for your help! :slight_smile:

Hi, thanks for using my package!
I think you’re right and this is related to creating arrays with the proper element type for differentiation. These changes might be enough to fix the initial error:


function with_endstates(v, size)
    a = zeros(eltype(v), size)  # FIX: added eltype(v) here
    for (i, x) in enumerate(v)
        a[i] = x
    end
    return a
end

function with_endstates(M::AbstractMatrix, size; fld=size)
    A = Matrix(Diagonal(ones(eltype(M), size)))  # FIX: added eltype(M) here
    for (i, m) in enumerate(M)
        A[fldmod1(i, fld)...] = m
    end
    return A
end

However I get a new error now, not sure where it comes from:

julia> chn = sample(model, NUTS(), 500)
┌ Warning: failed to find valid initial parameters in 10 tries; consider providing explicit initial parameters using the `initial_params` keyword
└ @ Turing.Inference ~/.julia/packages/Turing/oFGEb/src/mcmc/hmc.jl:188
ERROR: failed to find valid initial parameters in 1000 tries. This may indicate an error with the model or AD backend; please open an issue at https://github.com/TuringLang/Turing.jl/issues
Stacktrace:
  [1] error(s::String)
    @ Base ./error.jl:35
  [2] initialstep(rng::Random.TaskLocalRNG, model::DynamicPPL.Model{…}, spl::DynamicPPL.Sampler{…}, vi_original::DynamicPPL.TypedVarInfo{…}; initial_params::Nothing, nadapts::Int64, kwargs::@Kwargs{})
    @ Turing.Inference ~/.julia/packages/Turing/oFGEb/src/mcmc/hmc.jl:191
  [3] step(rng::Random.TaskLocalRNG, model::DynamicPPL.Model{…}, spl::DynamicPPL.Sampler{…}; initial_params::Nothing, kwargs::@Kwargs{…})
    @ DynamicPPL ~/.julia/packages/DynamicPPL/senfM/src/sampler.jl:130
  [4] step
    @ ~/.julia/packages/DynamicPPL/senfM/src/sampler.jl:113 [inlined]
  [5] macro expansion
    @ ~/.julia/packages/AbstractMCMC/FSyVk/src/sample.jl:159 [inlined]
  [6] macro expansion
    @ ~/.julia/packages/ProgressLogging/6KXlp/src/ProgressLogging.jl:328 [inlined]
  [7] macro expansion
    @ ~/.julia/packages/AbstractMCMC/FSyVk/src/logging.jl:9 [inlined]
  [8] mcmcsample(rng::Random.TaskLocalRNG, model::DynamicPPL.Model{…}, sampler::DynamicPPL.Sampler{…}, N::Int64; progress::Bool, progressname::String, callback::Nothing, num_warmup::Int64, discard_initial::Int64, thinning::Int64, chain_type::Type, initial_state::Nothing, kwargs::@Kwargs{…})
    @ AbstractMCMC ~/.julia/packages/AbstractMCMC/FSyVk/src/sample.jl:142
  [9] sample(rng::Random.TaskLocalRNG, model::DynamicPPL.Model{…}, sampler::DynamicPPL.Sampler{…}, N::Int64; chain_type::Type, resume_from::Nothing, initial_state::Nothing, progress::Bool, nadapts::Int64, discard_adapt::Bool, discard_initial::Int64, kwargs::@Kwargs{})
    @ Turing.Inference ~/.julia/packages/Turing/oFGEb/src/mcmc/hmc.jl:119
 [10] sample
    @ ~/.julia/packages/Turing/oFGEb/src/mcmc/hmc.jl:88 [inlined]
 [11] #sample#6
    @ ~/.julia/packages/Turing/oFGEb/src/mcmc/Inference.jl:321 [inlined]
 [12] sample
    @ ~/.julia/packages/Turing/oFGEb/src/mcmc/Inference.jl:312 [inlined]
 [13] #sample#5
    @ ~/.julia/packages/Turing/oFGEb/src/mcmc/Inference.jl:309 [inlined]
 [14] sample(model::DynamicPPL.Model{…}, alg::NUTS{…}, N::Int64)
    @ Turing.Inference ~/.julia/packages/Turing/oFGEb/src/mcmc/Inference.jl:306
 [15] top-level scope
    @ ~/Documents/GitHub/Julia/Scratchpad/mwe.jl:77
Some type information was truncated. Use `show(err)` to see complete types.

As a side note, why do you need to use Turing.jl? To insert the HMM into some bigger model?

1 Like

Thank you for the great package @gdalle! It’s been a lot of fun to use and reading the source code has been very educational for me, too.

I’ll mark this as solved since your reply fixes what I was initially struggling with. :slight_smile: Thank you!

As a side note, why do you need to use Turing.jl? To insert the HMM into some bigger model?

I decided to use Turing.jl after coming across an example in their new paper where HiddenMarkovModels.jl is used inside of a GARCH model, so the honest answer is probably “for fun” in the first place.

I initially worked on this model in Stan because I wasn’t sure how to account for hierarchical structures using HiddenMarkovModels.jl; the observation sequences here are nested within different ravens and different puzzles.

The first and most intuitive way to handle this is to have one HMM, and thus one set of parameters, per couple (raven, puzzle). That’s rather easy to achieve but you won’t benefit from the whole training set because it will be split up.

Now let’s assume you want to somehow mutualize the models for all ravens / puzzles. For instance, have a part of the model which is common to all birds, and one which depends on the individual. This sounds like a job for control variables, introduced in this tutorial. My package allows you to make the transition matrix and emission distributions at each time step t depend on an external control variable u_t. Here, your control variable could be a tuple (r, p) where r is the index of the raven and p is the index of the puzzle (this control would be constant across each trajectory). Thus you train a single model, but its behavior is influenced by exogenous data.

Does this make sense to you?

1 Like

Thus you train a single model, but its behavior is influenced by exogenous data.

So, the latter model would be different from the former “split up scenario” in that the latter model learns about each raven r_i from all observation sequences where that raven i was involved, and the same for puzzles p_i, instead of the specific (r_i, p_i) pairing only – have I understood that correctly?

If so, this would differ from the model I wrote in Stan by not partially pooling information across ravens. There are some ravens with very few observation sequences, so I wanted to introduce a population parameter to regularise estimation for these ravens.

The same would apply to puzzles, although there are no puzzles with particularly few observation sequences in the data I have.

You could also pool information across ravens or across puzzles! Something like this for example:

struct Control
    raven::Int
    puzzle::Int
end

struct RavenHMM
    # initialization, transitions
    common_emission_means::Vector{Float64}
    individual_emission_means::Matrix{Float64}
end

function obs_distributions(hmm::RavenHMM, control::Control)
    return [
        Normal(hmm.common_emission_means[i] + hmm.individual_emission_means[i, control.raven], 1)
        for i in 1:length(hmm)
    ]
end

Then, depending on the priors you use for common and individual parameters, those will get more or less weight in the final model I guess? There are probably other ways to weigh individual vs collective but I don’t remember my courses on population models ^^

1 Like

The weighting is what I’m struggling with, too. I’d want the model to place more weight on the individual means for those individuals with a lot of available data – if there’s evidence that they deviate from the common means, they should “be allowed to” – while the means of individuals with few data points should mostly be informed by the common means.

This kind of weighting would happen “by default” (maybe better: probability theory) if I introduce population parameters in Turing.jl or Stan, but my knowledge in statistics is unfortunately too poor to understand how to implement this here.

That being said, learning is fun – and even more so in a helpful community like the one here – so I hope I’ll be able to implement this one day. I really appreciate the time you’ve put into replying! :slight_smile:

1 Like

Can you point me to the part of the Turing docs that explains how one would do this? I may be able to translate it.

I’m not aware of any examples for hierarchical models in the Turing docs. There is a section on this in the Stan User’s Guide, but these aren’t the implementation details I believe you’d be looking for anyway?

I dug up some old slides (that I unfortunately wasn’t that interested in when I was still at uni :melting_face: ) on a model for Normal data, and tried to extend these to a hierarchical setting. If I did that correctly, the following should be the posterior for each individual mean in a Turing model:

\begin{aligned} p(\theta_i | \text{data}) &\propto p(\text{data}|\theta_i) p(\theta_i|\theta_{pop}) \\ &\propto \mathcal{N}(\theta_i | \mu_i, \tau^2_{post}) \end{aligned}

where

\frac{1}{\tau^2_{post}} = \frac{n_i}{\sigma^2} + \frac{1}{\tau_{pop}^2}, \\ \mu_i = w_i\bar{x}_i + (1-w_i) \theta_{pop},

and

w_i = \frac{\frac{n_i}{\sigma^2}}{\frac{n_i}{\sigma^2} + \frac{1}{\tau^2_{pop}}}.

\bar{x}_i is the mean of the observed data for each individual.

For \theta_{pop} we’d then have to weight the means of each individual by the number of observations that they each contribute.

Yes, that looks about right (except that using \tau^2 for a variance is a bit uncommon as \tau is often used to refer to the precision, i.e., inverse variance).
This model is not hierarchical though, i.e., your population level parameters \theta_{pop} are fixed, and corresponds to (conditionally) independent models, i.e., you would get the same posterior by only using data from group i. The shrinkage effect, i.e., keeping estimates of small groups close to the prior, is automatic in Bayesian analysis and simply reflects the strength of the prior information (the smaller \tau_{pop}^2 the stronger the prior).
In a hierarchical model also the population parameters are given a prior and can therefore be inferred from the data, i.e., the model is able to “learn” the spread of the data and adjusts the weighting accordingly – similar group level estimates \mu_i \implies strong “prior” around population parameters whereas heterogeneous groups \implies wider spread in estimates (and less shrinkage).

p(data | \theta_i) p(\theta_i | \theta_{pop}) p(\theta_{pop} | \theta_{hyper})

where p(\theta_{pop} | \theta_{hyper}) is the new hyper prior on \mu_{pop}, \tau^2_{pop}. The Stan manual you linked has also some discussion on how to choose hierarchical priors for regression models.

1 Like

I can’t dive into it right now but I just wanna note that everything hierarchical can be done inside Turing itself. The only think HiddenMarkovModels is responsible for will be merging the individual and population parameters, using the control variable trick I gave you.

Thank you @bertschi for the correction and better explanation than I could provide!

I’ll investigate the Turing error that came up after the differentiation issue was fixed and post an update here once I have the full model with hierarchical effects and the control variables up and running.

1 Like

If you struggle with Turing and autodiff, also note that you can incorporate priors directly in HiddenMarkovModels, to perform MAP estimation instead of MLE. All it takes is to overload logdensityof(hmm::MyCustomHMM). Take a look at this tutorial to learn more.

1 Like