That’s an interesting suggestion @Christopher_Fisher, thanks. The issue I have with the Normal() is that all of these parameters have “natural” bounds, i.e. they cannot (in the biological sense) be negative, and those that are probabilities (such as eta) cannot be larger than 1. Would you suggest a truncated Normal for these?
The transformation maps the whole real line to 0,1 so no need to truncate. Just be clear that the actual parameters are the ones after transformation.
I think we should look more carefully at the likelihood. I rarely have used the negative binomial so I’d have to look up it’s properties but I think one of the issues is that when your model predicts higher counts the uncertainty is enough to allow zero to be an ok observation. This lets the model predict waves that don’t appear in the data and get away with it.
Try using something like a gamma likelihood. Set up the shape such that when the model predicts some nontrivial cases, the density near 0 goes to zero… This strongly penalizes predictions that don’t come true.
For example near time 170 there is a predicted wave but reality was nothing going on.
Also when predicting nothing like near time 300, we should have a tight likelihood that strongly penalizes the bad prediction because actual data was thousands
My somewhat limited understanding is that the negative binomial is used for count data in which the mean and variance are not equal, as required by a Poisson distribution. A Poisson might be a simpler alternative to explore, but I’m not sure that the negative binomial is introducing a problem.
Edit:
A zero inflated variant might be worth considering if zeros are problematic.
Yes @Christopher_Fisher is right, these are discrete count data so we commonly use Poisson or Negative Binomial for the LL function. There is commonly overdispersion in the type of data I use for my models, so the NegBin is very commonly used. I also don’t think this causes the problem, although if the dispersion parameter is very large, this might favour solutions for which the trajectory fit is poor as @dlakelan mentioned above. So a tight prior is needed on the dispersion param, with favours smaller values of psi.
Yes it’s count data, but it’s well away from the regime where it’s small counts and can be treated instead as continuous.
Just for experimentation try using a Normal(prediction, 0.1*prediction+10) likelihood to see what I mean
Edit: if you’re predicting 0 it should still have a finite standard deviation so inflate it with a couple counts like I did
Thanks for the suggestion @dlakelan. Unfortunately this did not change anything re convergence. This is the traceplot for a ~Normal(prediction, 0.1*prediction+10)
likelihood. Again, 2/6 chains converged. This is with the priors at original scale though, no rescaling.
What are the priors you’re using? still uniform for Beta and eta? I don’t understand how Beta could possibly converge to 1 if you had a reasonable prior like Beta(5,20) or something. And Psi of Beta(2,20), eta of Beta(1,20)
If it is converging even then, it must be because the trajectories have so much 0 in them… that anything that predicts 0 for a long portion of the time series will fit very well in those regions.
It’s particularly problematic when near zero is also predicted tightly because you have hundreds of observations that are near 0. Likelihood perhaps of something like
Normal(prediction, 0.1*prediction + 200)
so that predictions near 0 are not super sharp.
Also, sometimes a model just won’t sample from a randomly chosen initialization position. Often I’ll either do an optimization to get an initial condition, or I’ll guess a decent initial condition, or try sampling from the informed prior.
What does the lp value look like in these different chains? I’m guessing some of these are really far from the converged lp value, they just can’t move because they’re in an island where in all directions things look worse. If that’s the case probably initial conditions are needed.
Yes, this is the main issue here. I have had this also with other models. I ended up discarding the chains that didn’t converge, but ultimately this isn’t the best approach to tackle this problem. As mentioned before, parallel tempering is probably worth a try, but this isn’t implemented in Turing yet.
If there are meaningful solutions that are isolated, then parallel tempering is needed. If however there is just one main solution but you are just not getting into the region where the one main solution is because of initialization… then it’s really about setting initial conditions.
yes, this is with Beta priors for all parameters. For some reason, the chains can still go to a very unlikely region of the prior, or more likely, they were initialized in an unlikely region and could not leave it.
If you have informed priors on everything. Try sampling from the prior and then using those samples to initialize your chains. This is likely enough to get you a converged solution.
What I have done in the past is to optim() the initial parameter guesses, but this obviously works only for one chain. In order to assess convergence, I need several chains, but I can’t initialize them with the same values since the chain is deterministic.
I’ve used Optim, and then perturbed that with normally distributed noise in the past… But I think just sampling from your informed priors should be sufficient.
yes that might work, will try it out later. Maybe it would make sense to implement it as an option in the Turing package, since others might have the same issue with their model.
Just to clarify, parallel tempering would only help if your posterior distribution is multimodal. I thought it was because the log likelihood appeared to be periodic with respect to phi. However, it turned out that was an error from the mutating parameter vector.
Turing provides an example of using maximum a posterior optimization here: https://turing.ml/v0.21/docs/using-turing/guide
The log likelihood surfaces above are generated when all other parameters are fixed to the true value, so they look smooth. But when the other parameters vary as well (which happened by accident in the phi plot for example), you get a difficult posterior geometry. I think this is what is happening during inference for this model.
Ah. Yes. Good point. I didn’t consider the possibility that periodicity might be not be a general property of the model. In that case, the probit transformation would not help with that problem, but it could improve sampling for other models with highly constrained scales.
So I tried the probit approach as you suggested @Christopher_Fisher, but to no avail. Only 1/6 chains converged to the true values (the yellow one). I think this result is even worse than with the untransformed parameters.
I wonder if there is a way to demonstrate that the problem is multimodality rather than other factors that would cause the parameters to get stuck. Perhaps the log likelihood distribution for each chain would be similar if they land on different modes. If you are getting stuck in areas with low posterior probability, the chains would have different log likelihood distributions. I think this is the code you would need:
histogram(chains[:log_density], alpha=.5, legend=:outerright)