Binary latent variable for occupancy modelling using Turing

I am trying to implement a simple species occupancy model, which makes use of a latent binary variable. I’ve tried to get inspiration from this repository, but got confused by the approach and tried to start from even simpler grounds.

The code is as follows. I start with simulating data:

Data Simulation
using Turing
using MCMCChains, StatsPlots
using Random

# Simulate the data
N_visits = 100
N_sites = 10

# True parameters
ψ = 0.8 # probability of occupancy
p = 0.4 # probability of detection

# Occupancy data
pres_vec = [rand(Bernoulli(ψ), N_visits) for i in 1:N_sites]

# Detection data
dets_vec = [rand.(Bernoulli.(pres*p)) for pres in pres_vec]

# Sites
sites = 1:N_sites

The model code follows:

# Model
@model occupancy(D, S) = begin
    # Priors
    ψ ~ Beta(1,1)
    p ~ Beta(1,1)

    # Initialize Z
    Z = Vector{Int}(undef, length(S))
    # Z ~ filldist(Bernoulli(ψ), length(S))

    for site in S
    	Z[site] ~ Bernoulli(ψ)
    	μ = Z[site] * p
	    for visit in 1:length(D[site])
	        D[site][visit] ~ Bernoulli(μ)


Which I sample with HMC.

sampler =  HMC(0.01, 10)
nsamples = 1000
chain = sample(occupancy(dets_vec, sites), sampler, nsamples);

plot(chain[["p", "ψ"]]; colordim=:parameter, legend=true)

This samples terribly.

julia> chain |> summarystats
Summary Statistics
  parameters      mean       std      mcse   ess_bulk   ess_tail      rhat   ess_per_sec 
      Symbol   Float64   Float64   Float64    Float64    Float64   Float64       Float64 

           ψ    0.8393    0.0000    0.0000        NaN        NaN       NaN           NaN
           p    0.8693    0.0000       NaN        NaN        NaN       NaN           NaN
        Z[1]    1.0000    0.0000       NaN        NaN        NaN       NaN           NaN
        Z[2]    1.0000    0.0000       NaN        NaN        NaN       NaN           NaN
        Z[3]    1.0000    0.0000       NaN        NaN        NaN       NaN           NaN
        Z[4]    1.0000    0.0000       NaN        NaN        NaN       NaN           NaN
        Z[5]    1.0000    0.0000       NaN        NaN        NaN       NaN           NaN
        Z[6]    1.0000    0.0000       NaN        NaN        NaN       NaN           NaN
        Z[7]    1.0000    0.0000       NaN        NaN        NaN       NaN           NaN
        Z[8]    1.0000    0.0000       NaN        NaN        NaN       NaN           NaN
        Z[9]    1.0000    0.0000       NaN        NaN        NaN       NaN           NaN
       Z[10]    1.0000    0.0000       NaN        NaN        NaN       NaN           NaN

I’d like some help in how to properly specify the latent Z variable. I have tried to use a vector of undef, as well as using filldist, but I am not getting anywhere.

What am I doing wrong in my specification of this binary latent variable? Is Turing like stan in that you have to integrate out categorical latent variables?

I have tried to implement the marginalization, to no avail.

using StatsFuns

@model occupancy(D) = begin

    # Q sums 
    Q = sum.(D)
    # Priors
    ψ ~ Beta(1,1)
    p ~ Beta(1,1)

    for (i, q) in enumerate(Q)
        if q > 0
            Turing.@addlogprob! log(ψ)
            for d in D[i]
                Turing.@addlogprob! logpdf(Bernoulli(p), d)
        elseif q == 0
            visit_sum = log(ψ)
            for _ in D[i]
                visit_sum += log(1-p)
            Turing.@addlogprob! logsumexp([visit_sum, log(1-ψ)])

I’d say the first thing you’d want to do is maybe figure out where you can vectorize some operations (I don’t think this will do anything to the results, but. might help with speed) to then be able to write vectorized sampling statements e.g. Z .~ Bernoulli.(phi) one way I’ve accomplished that is by introducing indexing arrays that you feed into your model, Z[idxs] .~ Bernoulli.(phi) for example.

But actually it looks like the index of Z does not matter, so you can do

Z ~ filldist(dist, S) # this will both create Z and sample

But aside from that, it might be helpful to sample in the space of Gaussians, and then transform to the appropriate range:

# instead of
p ~ Beta(1,1);

# try this
lam_p ~ Normal(-1, 1);
p = logistic(lam_p);

sampling from Gaussians in my experience, across almost all problems leads to faster models, better rhat, and larger ESS. Finally, try NUTS() , maybe this will work?

@model function occupancy(D, S) 
    lam_psi ~ Normal(-1, 0.5) # will have to play around with these
    lam_p ~ Normal(-1, 0.5) # to get the correct setting, or set them as hyper priors

    psi = logistic(lam_psi)
    p = logistic(lam_p)

    Z ~ filldist(Bernoulli(psi), S)

    u = Z * p

    for site in 1:S
        D[site] .~ Bernoulli.(u)

sampler = NUTS()
nsamples = 1000
chain = sample(occupancy(dets_vec, sites), sampler, nsamples);
1 Like

Your marginalized model is working for me. However, I don’t think your data generation matches your models.

This produces a 10 x 100 array of true occupation data (pres_vec) and a 10 x 100 array of detection data (dets_vec). Based on the occupation data, it would seem that the animal is truly moving in and out of each site, as if you have one sensor at each site and you’re collecting a time series.

However, your models (and, I think, the two resources you linked) suggest that they’re either in the whole time or out, as if you’re collecting data from 100 sensors per site at one time point. For example

specifies only one true occupancy datum per site.

Assuming that model is correct (and I apologize, as I’m not familiar with these models so I might be making some naïve assumptions), then you probably want something more like this, where you first simulate a single occupancy observation per site and then simulate your detection data per visit based on the occupancy datum.

pres_vec = rand(Bernoulli(ψ), N_sites)
dets_vec = [rand.(Bernoulli.(pres * p), N_visits) for pres in pres_vec]
1 Like

Yes, I think that is the solution! Thanks for catching this @simonbrauer , it was driving me crazy.